Re: Fetching inefficiency

Dennis Kubes Tue, 22 Apr 2008 06:59:07 -0700

We do it the old fashioned way :). The deep crawl is a separate crawldbwith a manually injected list of urls. Shallow crawl is a regular fullweb crawl. They can have overlapping urls, cnn.com for example. Shallowwill only fetch 50 pages, deep is unlimited up to the number of urls fora given shard. These two are then merged together at the crawldb level.

And yes we define the number of pages per shard, even in the deep crawlsthrough the topN parameter on the generator for fetchlists. It isapproximate, and because we are using the automated python jobstream itgrabs the *best* urls first for each fetch, there is the problem of urldegradation.

What I mean by this is later fetches even though they are the sameinitial fetchlist size will tend to have less urls which are good andactually fetched. So lets say we have 40 shards each with a 2M pagegenerate list. The first ones might fetch 1.95M pages good. The 40thone might only fetch 1M pages good. As best we can tell, this is simplybad urls. As scores get lower for continued crawls you tend to get moreurls that are simply not fetchable. But since the number of urls pershard is set in generator, we haven't found a way around this.


Dennis

[EMAIL PROTECTED] wrote:

Hi Dennis,

Ah, interesting, this is one of the things that was in the back of my mind, too - finding 
a way to "even out" the fetchlists, so that, if I can't figure out which 
servers are slow, I can at least get approximately equal number of pages from each site 
in the fetchlist.  It looks like you have two groups of sites - sites with a pile of 
pages that you want to crawl fully (deep, the head), and sites from which you are willing 
to fetch only a small number of pages.  This way you end up with 2 types of fetchlists, 
each with roughly equal number of pages from each site.  Did I get that right?

Question: how do you generate these two different types of fetchlists?  Same 
"generate" run, but with different urlfilter (prefix- or regex-urlfilter)? 
configs?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, April 21, 2008 7:43:43 PM
Subject: Re: Fetching inefficiency
This may not be applicable to what you are doing but for a whole webcrawl we tend to separate deep crawl sites and shallow crawl sites.Shallow crawl which is most of the web get a max of 50 pages set via thegenerate.max.per.host config variable. A deep crawl would contain onlya list of deep crawl sites, say wikipedia or cnn, and would be limitedby url filters and be allowed unlimited urls. A deep crawl would runthrough a number of fetch cycles, say a depth of 3-5.
Dennis

[EMAIL PROTECTED] wrote:
Hello,
I am wondering how others deal with the following, which I see as fetching
inefficiency:
When fetching, the fetchlist is broken up into multiple parts and fetchers on
cluster nodes start fetching. Some fetchers end up fetching from fast servers,and some from very very slow servers. Those fetching from slow servers take along time to complete and prolong the whole fetching process. For instance,I've seen tasks from the same fetch job finish in only 1-2 hours, and others in10 hours. Those taking 10 hours were stuck fetching pages from a single orhandful of slow sites. If you have two nodes doing the fetching and one isstuck with a slow server, the other one is idling and wasting time. The nodestuck with the slow server is also underutilized, as it's slowly fetching fromonly 1 server instead of many.
I imagine anyone using Nutch is seeing the same.  If not, what's the trick?
I have not tried overlapping fetching jobs yet, but I have a feeling that
won't help a ton, plus it could lead to two fetchers fetching from the sameserver and being impolite - am I wrong?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Re: Fetching inefficiency

Reply via email to