This may not be applicable to what you are doing but for a whole web crawl we tend to separate deep crawl sites and shallow crawl sites. Shallow crawl which is most of the web get a max of 50 pages set via the generate.max.per.host config variable. A deep crawl would contain only a list of deep crawl sites, say wikipedia or cnn, and would be limited by url filters and be allowed unlimited urls. A deep crawl would run through a number of fetch cycles, say a depth of 3-5.

Dennis

[EMAIL PROTECTED] wrote:
Hello,

I am wondering how others deal with the following, which I see as fetching 
inefficiency:


When fetching, the fetchlist is broken up into multiple parts and fetchers on 
cluster nodes start fetching.  Some fetchers end up fetching from fast servers, 
and some from very very slow servers.  Those fetching from slow servers take a 
long time to complete and prolong the whole fetching process.  For instance, 
I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 
10 hours.  Those taking 10 hours were stuck fetching pages from a single or 
handful of slow sites.  If you have two nodes doing the fetching and one is 
stuck with a slow server, the other one is idling and wasting time.  The node 
stuck with the slow server is also underutilized, as it's slowly fetching from 
only 1 server instead of many.

I imagine anyone using Nutch is seeing the same.  If not, what's the trick?

I have not tried overlapping fetching jobs yet, but I have a feeling that won't 
help a ton, plus it could lead to two fetchers fetching from the same server 
and being impolite - am I wrong?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Reply via email to