Thanks, Dennis I was wondering what the nutch.job file was for. I will give it a try.
Bob On Wed, Nov 12, 2008 at 10:50 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > There are two ways to run nutch. One is with a standard nutch install on > each node (which contains hadoop). Run that way the conf directory is in > the beginning of the classpath and the crawl-urlfilter.txt file is pulled > from the classpath, hence from the conf directory on each node. > > The other way is to run the nutch.job file on a standard hadoop cluster > using the hadoop -jar nutch.job command. With this approach you would > simply package the crawl-urlfilter.txt file in the nutch.job and it would > get deployed to all nodes. Remember this is with a hadoop cluster, not a > nutch cluster which alters a hadoop install with different elements in the > classpath. > > Dennis > > > > Robert Goodman wrote: > >> I'm running Nutch on a 4 node Hadoop cluster. I'm trying to understand how >> to best update the filters in files like crawl-urlfilter.txt before >> running >> a Nutch job on the Hadoop cluster. It looks like there are two possible >> options. >> >> 1. Manually copy the file to each Hadoop node >> 2. Looking at the source, it looks like URL could be specified in the >> site.xml file which would allow the file to be downloaded from a web >> server. >> >> >> In my experimenting with Nutch on the cluster I couldn't find a way to get >> Nutch to pass information in the crawl-urlfilter.txt to the other nodes in >> the cluster. Is there a better way to handle this problem when running >> Nutch >> on a Hadoop cluster? >> >> Thanks >> Bob >> >>
