This may not be applicable to what you are doing but for a whole web
crawl we tend to separate deep crawl sites and shallow crawl sites.
Shallow crawl which is most of the web get a max of 50 pages set via the
generate.max.per.host config variable. A deep crawl would contain only
a list of deep crawl sites, say wikipedia or cnn, and would be limited
by url filters and be allowed unlimited urls. A deep crawl would run
through a number of fetch cycles, say a depth of 3-5.
Dennis
[EMAIL PROTECTED] wrote:
Hello,
I am wondering how others deal with the following, which I see as fetching
inefficiency:
When fetching, the fetchlist is broken up into multiple parts and fetchers on
cluster nodes start fetching. Some fetchers end up fetching from fast servers,
and some from very very slow servers. Those fetching from slow servers take a
long time to complete and prolong the whole fetching process. For instance,
I've seen tasks from the same fetch job finish in only 1-2 hours, and others in
10 hours. Those taking 10 hours were stuck fetching pages from a single or
handful of slow sites. If you have two nodes doing the fetching and one is
stuck with a slow server, the other one is idling and wasting time. The node
stuck with the slow server is also underutilized, as it's slowly fetching from
only 1 server instead of many.
I imagine anyone using Nutch is seeing the same. If not, what's the trick?
I have not tried overlapping fetching jobs yet, but I have a feeling that won't
help a ton, plus it could lead to two fetchers fetching from the same server
and being impolite - am I wrong?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch