brainstorm wrote:
It was wondering... if I split the input urls like this:
url1.txt url2.txt ... urlN.txt
Will this input spread map jobs to N nodes ? Right now I'm using just
No, it won't - because these files are first added to a crawldb, and
only then Generator creates partial fetchlists out of the whole crawldb.
Here's how it works:
* Generator first prepares the list of candidate urls for fetching
* then it applies limits e.g. maximum number of urls per host
* and finally partitions the fetchlist so that all urls from the same
host end up in the same partition. The number of output partitions from
Generator is equal to the default number of map tasks. Why? because
Fetcher will create one map task per each partition in the fetchlist.
So - please check how many part-NNNNN files you have in the generated
fetchlist.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com