[jira] [Resolved] (NUTCH-1821) Nutch Crawl class for EMR

Julien Nioche (JIRA) Tue, 22 Jul 2014 01:20:07 -0700

     [ 
https://issues.apache.org/jira/browse/NUTCH-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julien Nioche resolved NUTCH-1821.
----------------------------------

    Resolution: Not a Problem

Hi, 

We got rid of the all in one Crawl command as it was opaque, hard to modify and 
having a script was more flexible. Just to give an example the modifications 
you suggested in (a), (b) or (c) are in the crawl script or if not they can be 
easily added. This wasn't the case with the class.

You can use the crawl script on EMR by sshing to the master node, compiling 
Nutch and starting your crawl as you'd do with any other Hadoop cluster. This 
allows you to keep an up to date version of Nutch and have the flexibility of 
the script.

> Nutch Crawl class for EMR
> -------------------------
>
>                 Key: NUTCH-1821
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1821
>             Project: Nutch
>          Issue Type: Wish
>    Affects Versions: 1.6
>         Environment: Amazon EMR
>            Reporter: Luis Lopez
>              Labels: Amazon, Crawler, EMR, performance
>
> Hi all,
> Some of us are using Amazon EMR to deploy/run Nutch and from what I've been 
> reading in the users mailing list there are 2 common issues people run 
> into... first, EMR supports up to Hadoop 1.0.3 (which is kind of old) and 
> second, from Nutch 1.8+ the Crawl class has been deprecated. 
> The first issue poses a problem when we try to deploy recent Nutch versions. 
> The most recent version that is supported by EMR is 1.6, the second issue is 
> that EMR receives a jar and main class to do its job and from 1.8 the Crawl 
> class has been removed.
> After some tests we completed a branch (from Nutch 1.6 + Hadoop 1.0.3) that 
> improves the old Crawl class so it scales, since 1.6 is an old version I 
> wonder how can we contribute back to those that need to use ElasticMapreduce.
> The things we did are:
> a) Add num fetchers as a parameter to the Crawl class.
>     For some reason the generator was always defaulting to one list( see: 
> http://stackoverflow.com/questions/10264183/why-does-nutch-only-run-the-fetch-step-on-one-hadoop-node-when-the-cluster-has)
>  creating just one fetch map task... with the new parameter we can adjust the 
> map tasks to fit the cluster size.
> b) Index documents on each Crawl cycle and not at the end.
>      We had performance/memory issues when we tried to index all the 
> documents when the whole crawl is done, we moved the index part into the main 
> Crawl cycle.
> c) We added an option to delete segments after their content is indexed into 
> Solr. It saves HDF space since the EC2 instances we use don't have a lot of 
> space.
> So far these fixes have allowed us to scale out Nutch and be efficient with 
> Amazon EMR clusters. If you guys think that there is some value on these 
> changes we can  submit a patch file.
> Luis.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (NUTCH-1821) Nutch Crawl class for EMR

Reply via email to