Re: Distributed fetching only happening in one node ?

brainstorm Mon, 11 Aug 2008 02:45:15 -0700

On Fri, Aug 8, 2008 at 1:18 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> brainstorm wrote:
>>
>> It was wondering... if I split the input urls like this:
>>
>> url1.txt url2.txt ... urlN.txt
>>
>> Will this input spread map jobs to N nodes ? Right now I'm using just
>
> No, it won't - because these files are first added to a crawldb, and only
> then Generator creates partial fetchlists out of the whole crawldb.
>
> Here's how it works:
>
> * Generator first prepares the list of candidate urls for fetching
>
> * then it applies limits e.g. maximum number of urls per host
>
> * and finally partitions the fetchlist so that all urls from the same host
> end up in the same partition. The number of output partitions from Generator
> is equal to the default number of map tasks. Why? because Fetcher will
> create one map task per each partition in the fetchlist.




Somebody said that 2 mapred.map.tasks was ok for a 7 node cluster
setup, but using greater values for mapred.map.tasks (tested from 2
till 256) does not alter the output/fix the problem, no additional
part-XXXX are generated for each map and no additional nodes
participate on fetching :/

What should I do ?



> So - please check how many part-NNNNN files you have in the generated
> fetchlist.



This is one example crawled segment:

/user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000

As you see, just one part-NNNN file is generated... in the conf file
(nutch-site.xml) mapred.map.tasks is set to 2 (default value, as
suggested in previous emails).



Thanks for your support ! ;)


>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Re: Distributed fetching only happening in one node ?

Reply via email to