2008/9/19 Edward Quick <[EMAIL PROTECTED]>:
>
> Also forgot to mention, what should mapred.map.tasks and mapred.reduce.tasks 
> be set to?
>

I haven't run fetcher in distributed mode for a while, but back then,
fetcher would run as many map tasks as there are
parts under crawl_generate. So, maybe this has changed. Anyway, try
setting mapred.map.tasks to 3 as well for fetching.
I think that may work.

> Thanks,
>
> Ed.
>
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: RE: running fetches in hadoop
> Date: Thu, 18 Sep 2008 19:36:45 +0000
>
>
>
>
>
>
>
>
>
>
>>
>> 2008/9/18 Edward Quick <[EMAIL PROTECTED]>:
>> >
>> > Thanks Doğacan,
>> >
>> > I set numFetchers but only see the fetch being done from one host at one 
>> > time, not all at the same time.
>> > This is what I ran:
>> >
>> > -bash-3.00$ bin/nutch generate crawl/crawldb crawl/segments -numFetchers 3
>> > Generator: Selecting best-scoring urls due for fetch.
>> > Generator: starting
>> > Generator: segment: crawl/segments/20080918173443
>> > Generator: filtering: true
>> > Generator: Partitioning selected urls by host, for politeness.
>> > Generator: done.
>> > -bash-3.00$ bin/nutch fetch crawl/segments/20080918173443
>> > Fetcher: starting
>> > Fetcher: segment: crawl/segments/20080918173443
>> >
>>
>> Hmm, how many parts are under crawl/segments/20080918173443/crawl_generate?
>
> -bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate
> Found 3 items
> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000     <r 1> 
>   86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001     <r 1> 
>   86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002     <r 1> 
>   442915  2008-09-18 17:35        rw-r--r--       nutch   supergroup
> -bash-3.00$
>
> This is what I have set in nutch-site.xml remembering I have 3 hosts:
> fetcher.server.delay 0.01
> fetcher.threads.fetch 10
> fetcher.threads.per.host 30
>
>>
>> >
>> >
>> >
>> >> Date: Thu, 18 Sep 2008 18:34:26 +0300
>> >> From: [EMAIL PROTECTED]
>> >> To: [email protected]
>> >> Subject: Re: running fetches in hadoop
>> >>
>> >> Hi,
>> >>
>> >> On Thu, Sep 18, 2008 at 5:23 PM, Edward Quick <[EMAIL PROTECTED]> wrote:
>> >> >
>> >> > I have 3 hosts in a hadoop cluster and noticed that the fetch only runs 
>> >> > from one host at a time.
>> >> > Is that right or should the fetch run from all 3 hosts at the same time?
>> >> >
>> >>
>> >> Try running generate like this:
>> >>
>> >> bin/nutch generate <other options> -numFetchers 3
>> >>
>> >> > Thanks,
>> >> >
>> >> > Ed.
>> >> >
>> >> > _________________________________________________________________
>> >> > Discover Bird's Eye View now with Multimap from Live Search
>> >> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>> >>
>> >>
>> >>
>> >> --
>> >> Doğacan Güney
>> >
>> > _________________________________________________________________
>> > Discover Bird's Eye View now with Multimap from Live Search
>> > http://clk.atdmt.com/UKM/go/111354026/direct/01/
>>
>>
>>
>> --
>> Doğacan Güney
>
> Try Facebook in Windows Live Messenger! Try it Now!
>
> _________________________________________________________________
> Make a mini you and download it into Windows Live Messenger
> http://clk.atdmt.com/UKM/go/111354029/direct/01/



-- 
Doğacan Güney

Reply via email to