[EMAIL PROTECTED] wrote:
Hi,
Is there any tutorial on running nutch on a few machines? And how to
turn off downloading and caching urls content?
There is an option in nutch-default.xml (which you should copy to your
nutch-site.xml and change its value): fetcher.store.content. If it's
false, then you fetcher will not store the content - however, you should
make sure that you run Fetcher in parsing mode (fetcher.parse set to
true). If you run it in non-parsing mode and attempt to run the parse
tool as a separate step, then you won't have any content to parse ;)
Regarding the issue of running Nutch on a few machines: the default
Nutch configuration is good for development and for casual crawling, and
it works even in the distributed setup with NameNode and JobTracker.
However, the procedure to set up a cluster and deploy Nutch artifacts is
poorly explained IMHO, so i would be nice to create a better tutorial
for this.
Here's my experience with this: I strongly recommend that users should
first set up a clean Hadoop cluster, using for example a binary Hadoop
release of the same number as the hadoop JAR in Nutch lib/ directory. At
the moment this would be hadoop-0.15.0.tgz. You should follow Hadoop
tutorials to complete this step, and run some sample jobs to verify that
the cluster works properly.
Then you should build a Nutch job that does NOT contain any Hadoop
artifacts, i.e. without lib/hadoop-*.jar or conf/hadoop* files. Patches
are welcome to automate this from within our build.xml.
Finally, you can start Nutch jobs by using bin/hadoop jar nutch-*.job
<className> <args ...>. You can also copy bin/nutch script to your
Hadoop bin/ directory and edit it to point to the right location of the
Nutch job file - this way you will be able to use bin/nutch <tool>
shortcuts.
You could ask why use this complicated procedure ... The problem with
the default setup is that if you simply copy the whole Nutch environment
to each node on the cluster, and then start Hadoop daemons using this
environment, the daemons will include Nutch classes and resources on
their classpath. This means that any classes or resources supplied in a
job jar will come after these resources. This in turn means that if
there are several versions of the same resource, one inside the job jar
and the other somewhere in build/classes or in conf/, then the
map-reduce tasks will use only the resource that comes first on the
classpath - which is most likely the one laying around on each node, and
not the updated one inside the job jar. You can see how this becomes a
problem if you want to keep the Hadoop cluster running all the time, and
only occasionally update the job jar.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com