Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process for an extended period of time (1M pages over 16 nodes, in this case), I notice there is one map task which performs _very_ poorly compared to the others:

4905 pages, 33094 errors, 3.5 pages/s, 432 kb/s,
    versus
46639 pages, 13227 errors, 43.9 pages/s, 4547 kb/s,

It is deficient in terms of raw pages/sec, execution time (it is the last map task to complete), and the number of errors encountered.

As I said, there seems to always be exactly one map task like this. Different fetch executions will have the thread assigned to different machines -- there doesn't seem to be any pattern.

What the heck is going on here?

My suspicion is that you're trying to fetch a large number of pages from a single site. Fetch tasks are partitioned by host name. All urls with a given host are fetched in a single fetcher map task. Grep the errors from the log on the slow node: I'll bet most are from a single host name.

To fix this, try setting generate.max.per.host.

A good value might be something like topN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting -topN to 10M and running with 10 fetch tasks and using 100 threads, then each fetch task will fetch around 1M urls, 10,000 per thread. Fetching a single host is single-threaded, so any host with more than 10,000 urls will slow the overall fetch.

Here's another way to think about it: If you're fetching a page/second per host (fetcher.server.delay) and your fetch tasks are averaging around an hour (3600 seconds) then any host which has more than 3600 pages will cause its fetch tasks to run slower than the others and/or to have high error rates.

Doug


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to