Hello, I am wondering how others deal with the following, which I see as fetching inefficiency:
When fetching, the fetchlist is broken up into multiple parts and fetchers on cluster nodes start fetching. Some fetchers end up fetching from fast servers, and some from very very slow servers. Those fetching from slow servers take a long time to complete and prolong the whole fetching process. For instance, I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 10 hours. Those taking 10 hours were stuck fetching pages from a single or handful of slow sites. If you have two nodes doing the fetching and one is stuck with a slow server, the other one is idling and wasting time. The node stuck with the slow server is also underutilized, as it's slowly fetching from only 1 server instead of many. I imagine anyone using Nutch is seeing the same. If not, what's the trick? I have not tried overlapping fetching jobs yet, but I have a feeling that won't help a ton, plus it could lead to two fetchers fetching from the same server and being impolite - am I wrong? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
