[Nutch-general] Re: When Nutch fetches using mapred ...

Shawn Gervais Mon, 10 Apr 2006 17:03:04 -0700

Doug Cutting wrote:

Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process foran extended period of time (1M pages over 16 nodes, in this case), Inotice there is one map task which performs _very_ poorly compared tothe others:

My suspicion is that you're trying to fetch a large number of pages froma single site. Fetch tasks are partitioned by host name. All urls witha given host are fetched in a single fetcher map task. Grep the errorsfrom the log on the slow node: I'll bet most are from a single host name.
To fix this, try setting generate.max.per.host.
A good value might be something liketopN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting-topN to 10M and running with 10 fetch tasks and using 100 threads, theneach fetch task will fetch around 1M urls, 10,000 per thread. Fetchinga single host is single-threaded, so any host with more than 10,000 urlswill slow the overall fetch.


Doug,

Thanks for the tip! You were indeed correct that the errant thread wasfetching pages from a handful of domains (cnn and geocities).

Setting generate.max.per.host has yielded more consistent performanceacross all my fetcher tasks.


Now to figure out why a lone reduce task always dies on large fetches :/

-Shawn


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: When Nutch fetches using mapred ...

Reply via email to