Doug Cutting wrote:
Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process for an extended period of time (1M pages over 16 nodes, in this case), I notice there is one map task which performs _very_ poorly compared to the others:


My suspicion is that you're trying to fetch a large number of pages from a single site. Fetch tasks are partitioned by host name. All urls with a given host are fetched in a single fetcher map task. Grep the errors from the log on the slow node: I'll bet most are from a single host name.

To fix this, try setting generate.max.per.host.

A good value might be something like topN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting -topN to 10M and running with 10 fetch tasks and using 100 threads, then each fetch task will fetch around 1M urls, 10,000 per thread. Fetching a single host is single-threaded, so any host with more than 10,000 urls will slow the overall fetch.

Doug,

Thanks for the tip! You were indeed correct that the errant thread was fetching pages from a handful of domains (cnn and geocities).

Setting generate.max.per.host has yielded more consistent performance across all my fetcher tasks.

Now to figure out why a lone reduce task always dies on large fetches :/

-Shawn

Reply via email to