Hi Andrzej,
I've been toying with the following idea, which is an extension of
the existing URLFilter mechanism and the concept of a "crawl
frontier".
Let's suppose we have several initial seed urls, each with a
different subjective quality. We would like to crawl these, and
expand the "crawling frontier" using outlinks. However, we don't
want to do it uniformly for every initial url, but rather propagate
certain "crawling policy" through the expanding trees of linked
pages. This "crawling policy" could consist of url filters, scoring
methods, etc - basically anything configurable in Nutch could be
included in this "policy". Perhaps it could even be the new version
of non-static NutchConf ;-)
Then, if a given initial url is a known high-quality source, we
would like to apply a "favor" policy, where we e.g. add pages linked
from that url, and in doing so we give them a higher score.
Recursively, we could apply the same policy for the next generation
pages, or perhaps only for pages belonging to the same domain. So,
in a sense the original notion of high-quality would cascade down to
other linked pages. The important aspect of this to note is that all
newly discovered pages would be subject to the same policy - unless
we have compelling reasons to switch the policy (from "favor" to
"default" or to "distrust"), which at that point would essentially
change the shape of the expanding frontier.
If a given initial url is a known spammer, we would like to apply a
"distrust" policy for adding pages linked from that url (e.g. adding
or not adding, if adding then lowering their score, or applying
different score calculation). And recursively we could apply a
similar policy of "distrust" to any pages discovered this way. We
could also change the policy on the way, if there are compelling
reasons to do so. This means that we could follow some high-quality
links from low-quality pages, without drilling down the sites which
are known to be of low quality.
Special care needs to be taken if the same page is discovered from
pages with different policies, I haven't thought about this aspect
yet... ;-)
This sounds like the TrustRank algorithm. See
http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust
attenuation via trust dampening (reducing the trust level as you get
further from a trusted page) and trust splitting (OPIC-like approach).
What would be the benefits of such approach?
* the initial page + policy would both control the expanding
crawling frontier, and it could be differently defined for different
starting pages. I.e. in a single web database we could keep
different "collections" or "areas of interest" with differently
specified policies. But still we could reap the benefits of a single
web db, namely the link information.
* URLFilters could be grouped into several policies, and it would be
easy to switch between them, or edit them.
* if the crawl process realizes it ended up on a spam page, it can
switch the page policy to "distrust", or the other way around, and
stop crawling unwanted content. From now on the pages linked from
that page will follow the new policy. In other words, if a crawling
frontier reaches pages with known quality problems, it would be easy
to change the policy on-the-fly to avoid them or pages linked from
them, without resorting to modifications of URLFilters.
Some of the above you can do even now with URLFilters, but any
change you do now has global consequences. You may also end up with
awfully complicated rules if you try to cover all cases in one rule
set.
The approach we took (with Nutch 0.7) is to use the Page nextScore
field as a 'crawl priority' field. We apply a scoring function to
each page, that takes into account the contents of the page, the
page's domain (we have a hand-selected set of known "good" domains
and sub-domains), and the page's OPIC score. This gets divided up
among the valid outlinks (after passing these through the URL
filter), and summed into the appropriate Page.nextScore entries in
the web db.
Then at crawl time we sort by nextScore, and pick a percentage of the
total unfetched links.
This gives us a pretty good depth-first crawl, where "depth" is
defined by the page content scoring function and the set of trusted
domains.
The four issues we've run into with this approach are:
1. Even with a pretty broad area of interest, you wind up focusing on
a subset of all domains. Which then means that the max threads per
host limit (for polite crawling) starts killing your efficiency.
To work around this, we've modified Nutch to order the fetch list
URLs by domain, and constrain the max # of URLs per domain based on
the total number of URLs to be fetched, and the number of threads
we're using.
The next step is to turn on HTTP keep-alive, which would be pretty
easy other than some funky issues we've run into with the http
connection pool, and the fact that the current protocol plug-in API
doesn't give us a good channel to pass back the info that the fetch
thread needs to control the keep alive process.
2. With a vertical crawl, you seem to wind up at "Big Bob's Server"
sooner/more often than with a breadth first crawl. Going wide means
you spend a lot more time on sites with lots of pages, and these are
typically higher performance/better behaved. With vertical, we seem
to hit a lot more slow/nonconforming servers, which then kill our
fetch performance.
Typical issues are things like servers sending back an endless stream
of HTTP response headers, or trickling data back for a big file.
To work around this, we've implemented support for monitoring thread
performance and interrupting threads that are taking too long.
3. With a vertical crawl, you typically want to keep the number of
fetched URLs (per loop) at a pretty low percentage relative to the
total number of unfetched URLs in the WebDB. This helps with
maintaining good crawl focus.
The problem we've run into is that the percentage of time spent
updating the WebDB, once you're past about 10M pages, starts to
dominate the total crawl time.
For example, we can fetch 200K pages in an hour on one machine, but
then it takes us 2.5 hours to update a 15M page WebDB with the
results. We can do a fetch in parallel with the update from the
previous crawl, but that doesn't buy us much. The typical solution is
to increase the number of URLs fetched, to balance out the fetch vs.
update time, but this winds up wasting bandwidth when most of the
extra URLs you fetch aren't all that interesting.
We're hoping that switching to Nutch 0.8 will help solve this issue.
We're in the middle of integrating our mods from 0.7 - don't know how
hard this will be.
4. We'd like to save multiple scores for a page, for the case where
we're applying multiple scoring functions to a page. But modifying
the WebDB to support a set of scores per page, while not hurting
performance, seems tricky.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers