Doug Cutting wrote:
Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process for
an extended period of time (1M pages over 16 nodes, in this case), I
notice there is one map task which performs _very_ poorly compared to
the others:
My suspicion is that you're trying to fetch a large number of pages from
a single site. Fetch tasks are partitioned by host name. All urls with
a given host are fetched in a single fetcher map task. Grep the errors
from the log on the slow node: I'll bet most are from a single host name.
To fix this, try setting generate.max.per.host.
A good value might be something like
topN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting
-topN to 10M and running with 10 fetch tasks and using 100 threads, then
each fetch task will fetch around 1M urls, 10,000 per thread. Fetching
a single host is single-threaded, so any host with more than 10,000 urls
will slow the overall fetch.
Doug,
Thanks for the tip! You were indeed correct that the errant thread was
fetching pages from a handful of domains (cnn and geocities).
Setting generate.max.per.host has yielded more consistent performance
across all my fetcher tasks.
Now to figure out why a lone reduce task always dies on large fetches :/
-Shawn
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general