Re: Nutch running on a Haddop cluster and crawl-urlfilter.txt

Dennis Kubes Wed, 12 Nov 2008 08:50:52 -0800

There are two ways to run nutch. One is with a standard nutch installon each node (which contains hadoop). Run that way the conf directoryis in the beginning of the classpath and the crawl-urlfilter.txt file ispulled from the classpath, hence from the conf directory on each node.

The other way is to run the nutch.job file on a standard hadoop clusterusing the hadoop -jar nutch.job command. With this approach you wouldsimply package the crawl-urlfilter.txt file in the nutch.job and itwould get deployed to all nodes. Remember this is with a hadoopcluster, not a nutch cluster which alters a hadoop install withdifferent elements in the classpath.


Dennis


Robert Goodman wrote:

I'm running Nutch on a 4 node Hadoop cluster. I'm trying to understand how
to best update the filters in files like crawl-urlfilter.txt before running
a Nutch job on the Hadoop cluster. It looks like there are two possible
options.

1. Manually copy the file to each Hadoop node
2. Looking at the source, it looks like URL could be specified in the
site.xml file which would allow the file to be downloaded from a web server.


In my experimenting with Nutch on the cluster I couldn't find a way to get
Nutch to pass information in the crawl-urlfilter.txt to the other nodes in
the cluster. Is there a better way to handle this problem when running Nutch
on a Hadoop cluster?

    Thanks
     Bob

Re: Nutch running on a Haddop cluster and crawl-urlfilter.txt

Reply via email to