There are two ways to run nutch. One is with a standard nutch install on each node (which contains hadoop). Run that way the conf directory is in the beginning of the classpath and the crawl-urlfilter.txt file is pulled from the classpath, hence from the conf directory on each node.

The other way is to run the nutch.job file on a standard hadoop cluster using the hadoop -jar nutch.job command. With this approach you would simply package the crawl-urlfilter.txt file in the nutch.job and it would get deployed to all nodes. Remember this is with a hadoop cluster, not a nutch cluster which alters a hadoop install with different elements in the classpath.

Dennis


Robert Goodman wrote:
I'm running Nutch on a 4 node Hadoop cluster. I'm trying to understand how
to best update the filters in files like crawl-urlfilter.txt before running
a Nutch job on the Hadoop cluster. It looks like there are two possible
options.

1. Manually copy the file to each Hadoop node
2. Looking at the source, it looks like URL could be specified in the
site.xml file which would allow the file to be downloaded from a web server.


In my experimenting with Nutch on the cluster I couldn't find a way to get
Nutch to pass information in the crawl-urlfilter.txt to the other nodes in
the cluster. Is there a better way to handle this problem when running Nutch
on a Hadoop cluster?

    Thanks
     Bob

Reply via email to