Hi Dennis,

Ah, interesting, this is one of the things that was in the back of my mind, too 
- finding a way to "even out" the fetchlists, so that, if I can't figure out 
which servers are slow, I can at least get approximately equal number of pages 
from each site in the fetchlist.  It looks like you have two groups of sites - 
sites with a pile of pages that you want to crawl fully (deep, the head), and 
sites from which you are willing to fetch only a small number of pages.  This 
way you end up with 2 types of fetchlists, each with roughly equal number of 
pages from each site.  Did I get that right?

Question: how do you generate these two different types of fetchlists?  Same 
"generate" run, but with different urlfilter (prefix- or regex-urlfilter)? 
configs?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: Dennis Kubes <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, April 21, 2008 7:43:43 PM
> Subject: Re: Fetching inefficiency
> 
> This may not be applicable to what you are doing but for a whole web 
> crawl we tend to separate deep crawl sites and shallow crawl sites. 
> Shallow crawl which is most of the web get a max of 50 pages set via the 
> generate.max.per.host config variable.  A deep crawl would contain only 
> a list of deep crawl sites, say wikipedia or cnn, and would be limited 
> by url filters and be allowed unlimited urls.  A deep crawl would run 
> through a number of fetch cycles, say a depth of 3-5.
> 
> Dennis
> 
> [EMAIL PROTECTED] wrote:
> > Hello,
> > 
> > I am wondering how others deal with the following, which I see as fetching 
> inefficiency:
> > 
> > 
> > When fetching, the fetchlist is broken up into multiple parts and fetchers 
> > on 
> cluster nodes start fetching.  Some fetchers end up fetching from fast 
> servers, 
> and some from very very slow servers.  Those fetching from slow servers take 
> a 
> long time to complete and prolong the whole fetching process.  For instance, 
> I've seen tasks from the same fetch job finish in only 1-2 hours, and others 
> in 
> 10 hours.  Those taking 10 hours were stuck fetching pages from a single or 
> handful of slow sites.  If you have two nodes doing the fetching and one is 
> stuck with a slow server, the other one is idling and wasting time.  The node 
> stuck with the slow server is also underutilized, as it's slowly fetching 
> from 
> only 1 server instead of many.
> > 
> > I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
> > 
> > I have not tried overlapping fetching jobs yet, but I have a feeling that 
> won't help a ton, plus it could lead to two fetchers fetching from the same 
> server and being impolite - am I wrong?
> > 
> > Thanks,
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 

Reply via email to