> Doğacan Güney wrote: > > 2008/9/19 Edward Quick <[EMAIL PROTECTED]>: > >> Also forgot to mention, what should mapred.map.tasks and > >> mapred.reduce.tasks be set to? > >> > > > > I haven't run fetcher in distributed mode for a while, but back then, > > fetcher would run as many map tasks as there are > > parts under crawl_generate. So, maybe this has changed. Anyway, try > > setting mapred.map.tasks to 3 as well for fetching. > > > > It didn't change, and that's not the issue here. Look at the size of the > parts: > > > >> -bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate > >> Found 3 items > >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000 <r > >> 1> 86 2008-09-18 17:35 rw-r--r-- nutch supergroup > >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001 <r > >> 1> 86 2008-09-18 17:35 rw-r--r-- nutch supergroup > >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002 <r > >> 1> 442915 2008-09-18 17:35 rw-r--r-- nutch supergroup > > The "problem" is that parts 0 and 1 contain no data (the 86 bytes is > consumed by a header of an empty SequenceFile). It's not really a > problem as such - this is most likely caused by a skewed distribution of > urls among hosts, i.e. all the urls on this fetchlist come from a single > or very few hosts (which accidentally are hashed to the same partition). > > Then, when you start the fetcher, it may create 3 tasks, and 2 of them > finish their job immediately (no input data), and all the remaining urls > are handled jut by a single task.
Thanks Andrzej. Any ideas how to fix this so the distribution of urls are shared equally between the 3 hosts? There is only one domain (our Intranet) which I need to crawl. _________________________________________________________________ Win New York holidays with Kellogg’s & Live Search http://clk.atdmt.com/UKM/go/111354033/direct/01/
