Re: When Nutch fetches using mapred ...

Shawn Gervais Mon, 10 Apr 2006 17:02:09 -0700

Doug Cutting wrote:

Shawn Gervais wrote:
When I perform a search large enough to observe the fetch process foran extended period of time (1M pages over 16 nodes, in this case), Inotice there is one map task which performs _very_ poorly compared tothe others:

My suspicion is that you're trying to fetch a large number of pages froma single site. Fetch tasks are partitioned by host name. All urls witha given host are fetched in a single fetcher map task. Grep the errorsfrom the log on the slow node: I'll bet most are from a single host name.
To fix this, try setting generate.max.per.host.
A good value might be something liketopN/(mapred.map.tasks*fetcher.threads.fetch). So if you're setting-topN to 10M and running with 10 fetch tasks and using 100 threads, theneach fetch task will fetch around 1M urls, 10,000 per thread. Fetchinga single host is single-threaded, so any host with more than 10,000 urlswill slow the overall fetch.


Doug,

Thanks for the tip! You were indeed correct that the errant thread wasfetching pages from a handful of domains (cnn and geocities).

Setting generate.max.per.host has yielded more consistent performanceacross all my fetcher tasks.


Now to figure out why a lone reduce task always dies on large fetches :/

-Shawn

Re: When Nutch fetches using mapred ...

Reply via email to