I'm running Nutch on a 4 node Hadoop cluster. I'm trying to understand how
to best update the filters in files like crawl-urlfilter.txt before running
a Nutch job on the Hadoop cluster. It looks like there are two possible
options.

1. Manually copy the file to each Hadoop node
2. Looking at the source, it looks like URL could be specified in the
site.xml file which would allow the file to be downloaded from a web server.


In my experimenting with Nutch on the cluster I couldn't find a way to get
Nutch to pass information in the crawl-urlfilter.txt to the other nodes in
the cluster. Is there a better way to handle this problem when running Nutch
on a Hadoop cluster?

    Thanks
     Bob

Reply via email to