We do it the old fashioned way :). The deep crawl is a separate crawldb
with a manually injected list of urls. Shallow crawl is a regular full
web crawl. They can have overlapping urls, cnn.com for example. Shallow
will only fetch 50 pages, deep is unlimited up to the number of urls for
a given shard. These two are then merged together at the crawldb level.
And yes we define the number of pages per shard, even in the deep crawls
through the topN parameter on the generator for fetchlists. It is
approximate, and because we are using the automated python jobstream it
grabs the *best* urls first for each fetch, there is the problem of url
degradation.
What I mean by this is later fetches even though they are the same
initial fetchlist size will tend to have less urls which are good and
actually fetched. So lets say we have 40 shards each with a 2M page
generate list. The first ones might fetch 1.95M pages good. The 40th
one might only fetch 1M pages good. As best we can tell, this is simply
bad urls. As scores get lower for continued crawls you tend to get more
urls that are simply not fetchable. But since the number of urls per
shard is set in generator, we haven't found a way around this.
Dennis
[EMAIL PROTECTED] wrote:
Hi Dennis,
Ah, interesting, this is one of the things that was in the back of my mind, too - finding
a way to "even out" the fetchlists, so that, if I can't figure out which
servers are slow, I can at least get approximately equal number of pages from each site
in the fetchlist. It looks like you have two groups of sites - sites with a pile of
pages that you want to crawl fully (deep, the head), and sites from which you are willing
to fetch only a small number of pages. This way you end up with 2 types of fetchlists,
each with roughly equal number of pages from each site. Did I get that right?
Question: how do you generate these two different types of fetchlists? Same
"generate" run, but with different urlfilter (prefix- or regex-urlfilter)?
configs?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Monday, April 21, 2008 7:43:43 PM
Subject: Re: Fetching inefficiency
This may not be applicable to what you are doing but for a whole web
crawl we tend to separate deep crawl sites and shallow crawl sites.
Shallow crawl which is most of the web get a max of 50 pages set via the
generate.max.per.host config variable. A deep crawl would contain only
a list of deep crawl sites, say wikipedia or cnn, and would be limited
by url filters and be allowed unlimited urls. A deep crawl would run
through a number of fetch cycles, say a depth of 3-5.
Dennis
[EMAIL PROTECTED] wrote:
Hello,
I am wondering how others deal with the following, which I see as fetching
inefficiency:
When fetching, the fetchlist is broken up into multiple parts and fetchers on
cluster nodes start fetching. Some fetchers end up fetching from fast servers,
and some from very very slow servers. Those fetching from slow servers take a
long time to complete and prolong the whole fetching process. For instance,
I've seen tasks from the same fetch job finish in only 1-2 hours, and others in
10 hours. Those taking 10 hours were stuck fetching pages from a single or
handful of slow sites. If you have two nodes doing the fetching and one is
stuck with a slow server, the other one is idling and wasting time. The node
stuck with the slow server is also underutilized, as it's slowly fetching from
only 1 server instead of many.
I imagine anyone using Nutch is seeing the same. If not, what's the trick?
I have not tried overlapping fetching jobs yet, but I have a feeling that
won't help a ton, plus it could lead to two fetchers fetching from the same
server and being impolite - am I wrong?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch