> Doğacan Güney wrote:
> > 2008/9/19 Edward Quick <[EMAIL PROTECTED]>:
> >> Also forgot to mention, what should mapred.map.tasks and
> >> mapred.reduce.tasks be set to?
> >>
> >
> > I haven't run fetcher in distributed mode for a while, but back then,
> > fetcher would run as many map tasks as there are
> > parts under crawl_generate. So, maybe this has changed. Anyway, try
> > setting mapred.map.tasks to 3 as well for fetching.
>
>
>
> It didn't change, and that's not the issue here. Look at the size of the
> parts:
>
>
> >> -bash-3.00$ bin/hadoop dfs -ls crawl/segments/20080918173443/crawl_generate
> >> Found 3 items
> >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00000 <r
> >> 1> 86 2008-09-18 17:35 rw-r--r-- nutch supergroup
> >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00001 <r
> >> 1> 86 2008-09-18 17:35 rw-r--r-- nutch supergroup
> >> /user/nutch/crawl/segments/20080918173443/crawl_generate/part-00002 <r
> >> 1> 442915 2008-09-18 17:35 rw-r--r-- nutch supergroup
>
> The "problem" is that parts 0 and 1 contain no data (the 86 bytes is
> consumed by a header of an empty SequenceFile). It's not really a
> problem as such - this is most likely caused by a skewed distribution of
> urls among hosts, i.e. all the urls on this fetchlist come from a single
> or very few hosts (which accidentally are hashed to the same partition).
>
> Then, when you start the fetcher, it may create 3 tasks, and 2 of them
> finish their job immediately (no input data), and all the remaining urls
> are handled jut by a single task.
> Thanks Andrzej. Any ideas how to fix this so the distribution of urls are
> shared equally between the 3 hosts?
> There is only one domain (our Intranet) which I need to crawl.
Ahh, I see this was already discussed in a recent thread:
http://www.mail-archive.com/[email protected]/msg11812.html
So in conclusion, is this saying it's not possible to fetch from the same site
at the same time on multiple nodes, or is there a way to override that?
Thanks for all your help.
Ed
_________________________________________________________________
Make a mini you and download it into Windows Live Messenger
http://clk.atdmt.com/UKM/go/111354029/direct/01/