Hi Dennis, Ah, interesting, this is one of the things that was in the back of my mind, too - finding a way to "even out" the fetchlists, so that, if I can't figure out which servers are slow, I can at least get approximately equal number of pages from each site in the fetchlist. It looks like you have two groups of sites - sites with a pile of pages that you want to crawl fully (deep, the head), and sites from which you are willing to fetch only a small number of pages. This way you end up with 2 types of fetchlists, each with roughly equal number of pages from each site. Did I get that right?
Question: how do you generate these two different types of fetchlists? Same "generate" run, but with different urlfilter (prefix- or regex-urlfilter)? configs? Thanks, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: Dennis Kubes <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, April 21, 2008 7:43:43 PM > Subject: Re: Fetching inefficiency > > This may not be applicable to what you are doing but for a whole web > crawl we tend to separate deep crawl sites and shallow crawl sites. > Shallow crawl which is most of the web get a max of 50 pages set via the > generate.max.per.host config variable. A deep crawl would contain only > a list of deep crawl sites, say wikipedia or cnn, and would be limited > by url filters and be allowed unlimited urls. A deep crawl would run > through a number of fetch cycles, say a depth of 3-5. > > Dennis > > [EMAIL PROTECTED] wrote: > > Hello, > > > > I am wondering how others deal with the following, which I see as fetching > inefficiency: > > > > > > When fetching, the fetchlist is broken up into multiple parts and fetchers > > on > cluster nodes start fetching. Some fetchers end up fetching from fast > servers, > and some from very very slow servers. Those fetching from slow servers take > a > long time to complete and prolong the whole fetching process. For instance, > I've seen tasks from the same fetch job finish in only 1-2 hours, and others > in > 10 hours. Those taking 10 hours were stuck fetching pages from a single or > handful of slow sites. If you have two nodes doing the fetching and one is > stuck with a slow server, the other one is idling and wasting time. The node > stuck with the slow server is also underutilized, as it's slowly fetching > from > only 1 server instead of many. > > > > I imagine anyone using Nutch is seeing the same. If not, what's the trick? > > > > I have not tried overlapping fetching jobs yet, but I have a feeling that > won't help a ton, plus it could lead to two fetchers fetching from the same > server and being impolite - am I wrong? > > > > Thanks, > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > >
