Re: Nutch - crashed during a large fetch, how to restart?

Andrzej Bialecki Wed, 02 Jan 2008 04:11:16 -0800

[EMAIL PROTECTED] wrote:

Hi,


Is there any tutorial on running nutch on a few machines? And how to
turn off downloading and caching urls content?

There is an option in nutch-default.xml (which you should copy to yournutch-site.xml and change its value): fetcher.store.content. If it'sfalse, then you fetcher will not store the content - however, you shouldmake sure that you run Fetcher in parsing mode (fetcher.parse set totrue). If you run it in non-parsing mode and attempt to run the parsetool as a separate step, then you won't have any content to parse ;)

Regarding the issue of running Nutch on a few machines: the defaultNutch configuration is good for development and for casual crawling, andit works even in the distributed setup with NameNode and JobTracker.However, the procedure to set up a cluster and deploy Nutch artifacts ispoorly explained IMHO, so i would be nice to create a better tutorialfor this.

Here's my experience with this: I strongly recommend that users shouldfirst set up a clean Hadoop cluster, using for example a binary Hadooprelease of the same number as the hadoop JAR in Nutch lib/ directory. Atthe moment this would be hadoop-0.15.0.tgz. You should follow Hadooptutorials to complete this step, and run some sample jobs to verify thatthe cluster works properly.

Then you should build a Nutch job that does NOT contain any Hadoopartifacts, i.e. without lib/hadoop-*.jar or conf/hadoop* files. Patchesare welcome to automate this from within our build.xml.

Finally, you can start Nutch jobs by using bin/hadoop jar nutch-*.job<className> <args ...>. You can also copy bin/nutch script to yourHadoop bin/ directory and edit it to point to the right location of theNutch job file - this way you will be able to use bin/nutch <tool>shortcuts.

You could ask why use this complicated procedure ... The problem withthe default setup is that if you simply copy the whole Nutch environmentto each node on the cluster, and then start Hadoop daemons using thisenvironment, the daemons will include Nutch classes and resources ontheir classpath. This means that any classes or resources supplied in ajob jar will come after these resources. This in turn means that ifthere are several versions of the same resource, one inside the job jarand the other somewhere in build/classes or in conf/, then themap-reduce tasks will use only the resource that comes first on theclasspath - which is most likely the one laying around on each node, andnot the updated one inside the job jar. You can see how this becomes aproblem if you want to keep the Hadoop cluster running all the time, andonly occasionally update the job jar.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Nutch - crashed during a large fetch, how to restart?

Reply via email to