Re: Nutch running on a Haddop cluster and crawl-urlfilter.txt

Robert Goodman Thu, 13 Nov 2008 07:48:24 -0800

Thanks, Dennis

I was wondering what the nutch.job file was for. I will give it a try.


   Bob




On Wed, Nov 12, 2008 at 10:50 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote:

>
> There are two ways to run nutch.  One is with a standard nutch install on
> each node (which contains hadoop).  Run that way the conf directory is in
> the beginning of the classpath and the crawl-urlfilter.txt file is pulled
> from the classpath, hence from the conf directory on each node.
>
> The other way is to run the nutch.job file on a standard hadoop cluster
> using the hadoop -jar nutch.job command.  With this approach you would
> simply package the crawl-urlfilter.txt file in the nutch.job and it would
> get deployed to all nodes.  Remember this is with a hadoop cluster, not a
> nutch cluster which alters a hadoop install with different elements in the
> classpath.
>
> Dennis
>
>
>
> Robert Goodman wrote:
>
>> I'm running Nutch on a 4 node Hadoop cluster. I'm trying to understand how
>> to best update the filters in files like crawl-urlfilter.txt before
>> running
>> a Nutch job on the Hadoop cluster. It looks like there are two possible
>> options.
>>
>> 1. Manually copy the file to each Hadoop node
>> 2. Looking at the source, it looks like URL could be specified in the
>> site.xml file which would allow the file to be downloaded from a web
>> server.
>>
>>
>> In my experimenting with Nutch on the cluster I couldn't find a way to get
>> Nutch to pass information in the crawl-urlfilter.txt to the other nodes in
>> the cluster. Is there a better way to handle this problem when running
>> Nutch
>> on a Hadoop cluster?
>>
>>    Thanks
>>     Bob
>>
>>

Re: Nutch running on a Haddop cluster and crawl-urlfilter.txt

Reply via email to