Re: running fetches in hadoop

Andrzej Bialecki Fri, 19 Sep 2008 04:43:57 -0700

Doğacan Güney wrote:

2008/9/19 Edward Quick <[EMAIL PROTECTED]>:

Also forgot to mention, what should mapred.map.tasks and mapred.reduce.tasks be 
set to?


I haven't run fetcher in distributed mode for a while, but back then,
fetcher would run as many map tasks as there are
parts under crawl_generate. So, maybe this has changed. Anyway, try
setting mapred.map.tasks to 3 as well for fetching.

It didn't change, and that's not the issue here. Look at the size of theparts:

-bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate
Found 3 items
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000     <r 1>   
86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001     <r 1>   
86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
/user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002     <r 1>   
442915  2008-09-18 17:35        rw-r--r--       nutch   supergroup

The "problem" is that parts 0 and 1 contain no data (the 86 bytes isconsumed by a header of an empty SequenceFile). It's not really aproblem as such - this is most likely caused by a skewed distribution ofurls among hosts, i.e. all the urls on this fetchlist come from a singleor very few hosts (which accidentally are hashed to the same partition).

Then, when you start the fetcher, it may create 3 tasks, and 2 of themfinish their job immediately (no input data), and all the remaining urlsare handled jut by a single task.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: running fetches in hadoop

Reply via email to