On Fri, Aug 8, 2008 at 1:18 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > brainstorm wrote: >> >> It was wondering... if I split the input urls like this: >> >> url1.txt url2.txt ... urlN.txt >> >> Will this input spread map jobs to N nodes ? Right now I'm using just > > No, it won't - because these files are first added to a crawldb, and only > then Generator creates partial fetchlists out of the whole crawldb. > > Here's how it works: > > * Generator first prepares the list of candidate urls for fetching > > * then it applies limits e.g. maximum number of urls per host > > * and finally partitions the fetchlist so that all urls from the same host > end up in the same partition. The number of output partitions from Generator > is equal to the default number of map tasks. Why? because Fetcher will > create one map task per each partition in the fetchlist.
Somebody said that 2 mapred.map.tasks was ok for a 7 node cluster setup, but using greater values for mapred.map.tasks (tested from 2 till 256) does not alter the output/fix the problem, no additional part-XXXX are generated for each map and no additional nodes participate on fetching :/ What should I do ? > So - please check how many part-NNNNN files you have in the generated > fetchlist. This is one example crawled segment: /user/hadoop/crawl-dmoz/segments/20080806192122/content/part-00000 As you see, just one part-NNNN file is generated... in the conf file (nutch-site.xml) mapred.map.tasks is set to 2 (default value, as suggested in previous emails). Thanks for your support ! ;) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
