Doug Cutting wrote:
Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process for an extended period of time (1M pages over 16 nodes, in this case), I notice there is one map task which performs _very_ poorly compared to the others:


My suspicion is that you're trying to fetch a large number of pages from a single site. Fetch tasks are partitioned by host name. All urls with a given host are fetched in a single fetcher map task. Grep the errors from the log on the slow node: I'll bet most are from a single host name.

To fix this, try setting generate.max.per.host.

A good value might be something like topN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting -topN to 10M and running with 10 fetch tasks and using 100 threads, then each fetch task will fetch around 1M urls, 10,000 per thread. Fetching a single host is single-threaded, so any host with more than 10,000 urls will slow the overall fetch.

Doug,

Thanks for the tip! You were indeed correct that the errant thread was fetching pages from a handful of domains (cnn and geocities).

Setting generate.max.per.host has yielded more consistent performance across all my fetcher tasks.

Now to figure out why a lone reduce task always dies on large fetches :/

-Shawn


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to