Doğacan Güney wrote:
2008/9/19 Edward Quick <[EMAIL PROTECTED]>:
Also forgot to mention, what should mapred.map.tasks and mapred.reduce.tasks be 
set to?


I haven't run fetcher in distributed mode for a while, but back then,
fetcher would run as many map tasks as there are
parts under crawl_generate. So, maybe this has changed. Anyway, try
setting mapred.map.tasks to 3 as well for fetching.



It didn't change, and that's not the issue here. Look at the size of the parts:


-bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate
Found 3 items
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000     <r 1>   
86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001     <r 1>   
86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002     <r 1>   
442915  2008-09-18 17:35        rw-r--r--       nutch   supergroup

The "problem" is that parts 0 and 1 contain no data (the 86 bytes is consumed by a header of an empty SequenceFile). It's not really a problem as such - this is most likely caused by a skewed distribution of urls among hosts, i.e. all the urls on this fetchlist come from a single or very few hosts (which accidentally are hashed to the same partition).

Then, when you start the fetcher, it may create 3 tasks, and 2 of them finish their job immediately (no input data), and all the remaining urls are handled jut by a single task.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to