There are two ways to run nutch. One is with a standard nutch install
on each node (which contains hadoop). Run that way the conf directory
is in the beginning of the classpath and the crawl-urlfilter.txt file is
pulled from the classpath, hence from the conf directory on each node.
The other way is to run the nutch.job file on a standard hadoop cluster
using the hadoop -jar nutch.job command. With this approach you would
simply package the crawl-urlfilter.txt file in the nutch.job and it
would get deployed to all nodes. Remember this is with a hadoop
cluster, not a nutch cluster which alters a hadoop install with
different elements in the classpath.
Dennis
Robert Goodman wrote:
I'm running Nutch on a 4 node Hadoop cluster. I'm trying to understand how
to best update the filters in files like crawl-urlfilter.txt before running
a Nutch job on the Hadoop cluster. It looks like there are two possible
options.
1. Manually copy the file to each Hadoop node
2. Looking at the source, it looks like URL could be specified in the
site.xml file which would allow the file to be downloaded from a web server.
In my experimenting with Nutch on the cluster I couldn't find a way to get
Nutch to pass information in the crawl-urlfilter.txt to the other nodes in
the cluster. Is there a better way to handle this problem when running Nutch
on a Hadoop cluster?
Thanks
Bob