I do face a similar problem. I occasionally have some fetch jobs that are fetching from less than 100 hosts, the effect is magnified in this case.
I have found one workaround for this but I am not sure if this is the best possible solution: I set the value of generate.max.per.host to a pretty small value (like 1000) and this reduces the maximum amount of time any task is going to be held up due to a particular host. This does increase the number of cycles that are needed to finish a crawl but does solve the mentioned problem. It might even make sense to have an even lower value -- I am still experimenting to find a good value myself. In addition, I think NUTCH-629 and NUTCH-570 could help reduce the effects of the problem caused by slow servers. Best, Siddhartha Reddy On Tue, Apr 22, 2008 at 1:46 AM, <[EMAIL PROTECTED]> wrote: > Hello, > > I am wondering how others deal with the following, which I see as fetching > inefficiency: > > > When fetching, the fetchlist is broken up into multiple parts and fetchers > on cluster nodes start fetching. Some fetchers end up fetching from fast > servers, and some from very very slow servers. Those fetching from slow > servers take a long time to complete and prolong the whole fetching process. > For instance, I've seen tasks from the same fetch job finish in only 1-2 > hours, and others in 10 hours. Those taking 10 hours were stuck fetching > pages from a single or handful of slow sites. If you have two nodes doing > the fetching and one is stuck with a slow server, the other one is idling > and wasting time. The node stuck with the slow server is also > underutilized, as it's slowly fetching from only 1 server instead of many. > > I imagine anyone using Nutch is seeing the same. If not, what's the > trick? > > I have not tried overlapping fetching jobs yet, but I have a feeling that > won't help a ton, plus it could lead to two fetchers fetching from the same > server and being impolite - am I wrong? > > Thanks, > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > -- http://sids.in "If you are not having fun, you are not doing it right."
