[EMAIL PROTECTED] wrote:
Hi,

Is there any tutorial on running nutch on a few machines? And how to
turn off downloading and caching urls content?

There is an option in nutch-default.xml (which you should copy to your nutch-site.xml and change its value): fetcher.store.content. If it's false, then you fetcher will not store the content - however, you should make sure that you run Fetcher in parsing mode (fetcher.parse set to true). If you run it in non-parsing mode and attempt to run the parse tool as a separate step, then you won't have any content to parse ;)

Regarding the issue of running Nutch on a few machines: the default Nutch configuration is good for development and for casual crawling, and it works even in the distributed setup with NameNode and JobTracker. However, the procedure to set up a cluster and deploy Nutch artifacts is poorly explained IMHO, so i would be nice to create a better tutorial for this.

Here's my experience with this: I strongly recommend that users should first set up a clean Hadoop cluster, using for example a binary Hadoop release of the same number as the hadoop JAR in Nutch lib/ directory. At the moment this would be hadoop-0.15.0.tgz. You should follow Hadoop tutorials to complete this step, and run some sample jobs to verify that the cluster works properly.

Then you should build a Nutch job that does NOT contain any Hadoop artifacts, i.e. without lib/hadoop-*.jar or conf/hadoop* files. Patches are welcome to automate this from within our build.xml.

Finally, you can start Nutch jobs by using bin/hadoop jar nutch-*.job <className> <args ...>. You can also copy bin/nutch script to your Hadoop bin/ directory and edit it to point to the right location of the Nutch job file - this way you will be able to use bin/nutch <tool> shortcuts.

You could ask why use this complicated procedure ... The problem with the default setup is that if you simply copy the whole Nutch environment to each node on the cluster, and then start Hadoop daemons using this environment, the daemons will include Nutch classes and resources on their classpath. This means that any classes or resources supplied in a job jar will come after these resources. This in turn means that if there are several versions of the same resource, one inside the job jar and the other somewhere in build/classes or in conf/, then the map-reduce tasks will use only the resource that comes first on the classpath - which is most likely the one laying around on each node, and not the updated one inside the job jar. You can see how this becomes a problem if you want to keep the Hadoop cluster running all the time, and only occasionally update the job jar.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to