Hello,

I am wondering how others deal with the following, which I see as fetching 
inefficiency:


When fetching, the fetchlist is broken up into multiple parts and fetchers on 
cluster nodes start fetching.  Some fetchers end up fetching from fast servers, 
and some from very very slow servers.  Those fetching from slow servers take a 
long time to complete and prolong the whole fetching process.  For instance, 
I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 
10 hours.  Those taking 10 hours were stuck fetching pages from a single or 
handful of slow sites.  If you have two nodes doing the fetching and one is 
stuck with a slow server, the other one is idling and wasting time.  The node 
stuck with the slow server is also underutilized, as it's slowly fetching from 
only 1 server instead of many.

I imagine anyone using Nutch is seeing the same.  If not, what's the trick?

I have not tried overlapping fetching jobs yet, but I have a feeling that won't 
help a ton, plus it could lead to two fetchers fetching from the same server 
and being impolite - am I wrong?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to