RE: running fetches in hadoop

Edward Quick Fri, 19 Sep 2008 05:48:04 -0700

> Doğacan Güney wrote:
> > 2008/9/19 Edward Quick <[EMAIL PROTECTED]>:
> >> Also forgot to mention, what should mapred.map.tasks and 
> >> mapred.reduce.tasks be set to?
> >>
> > 
> > I haven't run fetcher in distributed mode for a while, but back then,
> > fetcher would run as many map tasks as there are
> > parts under crawl_generate. So, maybe this has changed. Anyway, try
> > setting mapred.map.tasks to 3 as well for fetching.
> 
> 
> 
> It didn't change, and that's not the issue here. Look at the size of the 
> parts:
> 
> 
> >> -bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate
> >> Found 3 items
> >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000     <r 
> >> 1>   86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
> >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001     <r 
> >> 1>   86      2008-09-18 17:35        rw-r--r--       nutch   supergroup
> >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002     <r 
> >> 1>   442915  2008-09-18 17:35        rw-r--r--       nutch   supergroup
> 
> The "problem" is that parts 0 and 1 contain no data (the 86 bytes is 
> consumed by a header of an empty SequenceFile). It's not really a 
> problem as such - this is most likely caused by a skewed distribution of 
> urls among hosts, i.e. all the urls on this fetchlist come from a single 
> or very few hosts (which accidentally are hashed to the same partition).
> 
> Then, when you start the fetcher, it may create 3 tasks, and 2 of them 
> finish their job immediately (no input data), and all the remaining urls 
> are handled jut by a single task.


Thanks Andrzej. Any ideas how to fix this so the distribution of urls are 
shared equally between the 3 hosts?
There is only one domain (our Intranet) which I need to crawl.


_________________________________________________________________
Win New York holidays with Kellogg’s & Live Search
http://clk.atdmt.com/UKM/go/111354033/direct/01/

RE: running fetches in hadoop

Reply via email to