We do it the old fashioned way :). The deep crawl is a separate crawldb with a manually injected list of urls. Shallow crawl is a regular full web crawl. They can have overlapping urls, cnn.com for example. Shallow will only fetch 50 pages, deep is unlimited up to the number of urls for a given shard. These two are then merged together at the crawldb level.

And yes we define the number of pages per shard, even in the deep crawls through the topN parameter on the generator for fetchlists. It is approximate, and because we are using the automated python jobstream it grabs the *best* urls first for each fetch, there is the problem of url degradation.

What I mean by this is later fetches even though they are the same initial fetchlist size will tend to have less urls which are good and actually fetched. So lets say we have 40 shards each with a 2M page generate list. The first ones might fetch 1.95M pages good. The 40th one might only fetch 1M pages good. As best we can tell, this is simply bad urls. As scores get lower for continued crawls you tend to get more urls that are simply not fetchable. But since the number of urls per shard is set in generator, we haven't found a way around this.

Dennis

[EMAIL PROTECTED] wrote:
Hi Dennis,

Ah, interesting, this is one of the things that was in the back of my mind, too - finding 
a way to "even out" the fetchlists, so that, if I can't figure out which 
servers are slow, I can at least get approximately equal number of pages from each site 
in the fetchlist.  It looks like you have two groups of sites - sites with a pile of 
pages that you want to crawl fully (deep, the head), and sites from which you are willing 
to fetch only a small number of pages.  This way you end up with 2 types of fetchlists, 
each with roughly equal number of pages from each site.  Did I get that right?

Question: how do you generate these two different types of fetchlists?  Same 
"generate" run, but with different urlfilter (prefix- or regex-urlfilter)? 
configs?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, April 21, 2008 7:43:43 PM
Subject: Re: Fetching inefficiency

This may not be applicable to what you are doing but for a whole web crawl we tend to separate deep crawl sites and shallow crawl sites. Shallow crawl which is most of the web get a max of 50 pages set via the generate.max.per.host config variable. A deep crawl would contain only a list of deep crawl sites, say wikipedia or cnn, and would be limited by url filters and be allowed unlimited urls. A deep crawl would run through a number of fetch cycles, say a depth of 3-5.

Dennis

[EMAIL PROTECTED] wrote:
Hello,

I am wondering how others deal with the following, which I see as fetching
inefficiency:

When fetching, the fetchlist is broken up into multiple parts and fetchers on
cluster nodes start fetching. Some fetchers end up fetching from fast servers, and some from very very slow servers. Those fetching from slow servers take a long time to complete and prolong the whole fetching process. For instance, I've seen tasks from the same fetch job finish in only 1-2 hours, and others in 10 hours. Those taking 10 hours were stuck fetching pages from a single or handful of slow sites. If you have two nodes doing the fetching and one is stuck with a slow server, the other one is idling and wasting time. The node stuck with the slow server is also underutilized, as it's slowly fetching from only 1 server instead of many.
I imagine anyone using Nutch is seeing the same.  If not, what's the trick?

I have not tried overlapping fetching jobs yet, but I have a feeling that
won't help a ton, plus it could lead to two fetchers fetching from the same server and being impolite - am I wrong?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


Reply via email to