Hi Andrzej,

I've been toying with the following idea, which is an extension of the existing URLFilter mechanism and the concept of a "crawl frontier".

Let's suppose we have several initial seed urls, each with a different subjective quality. We would like to crawl these, and expand the "crawling frontier" using outlinks. However, we don't want to do it uniformly for every initial url, but rather propagate certain "crawling policy" through the expanding trees of linked pages. This "crawling policy" could consist of url filters, scoring methods, etc - basically anything configurable in Nutch could be included in this "policy". Perhaps it could even be the new version of non-static NutchConf ;-)

Then, if a given initial url is a known high-quality source, we would like to apply a "favor" policy, where we e.g. add pages linked from that url, and in doing so we give them a higher score. Recursively, we could apply the same policy for the next generation pages, or perhaps only for pages belonging to the same domain. So, in a sense the original notion of high-quality would cascade down to other linked pages. The important aspect of this to note is that all newly discovered pages would be subject to the same policy - unless we have compelling reasons to switch the policy (from "favor" to "default" or to "distrust"), which at that point would essentially change the shape of the expanding frontier.

If a given initial url is a known spammer, we would like to apply a "distrust" policy for adding pages linked from that url (e.g. adding or not adding, if adding then lowering their score, or applying different score calculation). And recursively we could apply a similar policy of "distrust" to any pages discovered this way. We could also change the policy on the way, if there are compelling reasons to do so. This means that we could follow some high-quality links from low-quality pages, without drilling down the sites which are known to be of low quality.

Special care needs to be taken if the same page is discovered from pages with different policies, I haven't thought about this aspect yet... ;-)

This sounds like the TrustRank algorithm. See http://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trust attenuation via trust dampening (reducing the trust level as you get further from a trusted page) and trust splitting (OPIC-like approach).

What would be the benefits of such approach?

* the initial page + policy would both control the expanding crawling frontier, and it could be differently defined for different starting pages. I.e. in a single web database we could keep different "collections" or "areas of interest" with differently specified policies. But still we could reap the benefits of a single web db, namely the link information.

* URLFilters could be grouped into several policies, and it would be easy to switch between them, or edit them.

* if the crawl process realizes it ended up on a spam page, it can switch the page policy to "distrust", or the other way around, and stop crawling unwanted content. From now on the pages linked from that page will follow the new policy. In other words, if a crawling frontier reaches pages with known quality problems, it would be easy to change the policy on-the-fly to avoid them or pages linked from them, without resorting to modifications of URLFilters.

Some of the above you can do even now with URLFilters, but any change you do now has global consequences. You may also end up with awfully complicated rules if you try to cover all cases in one rule set.

The approach we took (with Nutch 0.7) is to use the Page nextScore field as a 'crawl priority' field. We apply a scoring function to each page, that takes into account the contents of the page, the page's domain (we have a hand-selected set of known "good" domains and sub-domains), and the page's OPIC score. This gets divided up among the valid outlinks (after passing these through the URL filter), and summed into the appropriate Page.nextScore entries in the web db.

Then at crawl time we sort by nextScore, and pick a percentage of the total unfetched links.

This gives us a pretty good depth-first crawl, where "depth" is defined by the page content scoring function and the set of trusted domains.

The four issues we've run into with this approach are:

1. Even with a pretty broad area of interest, you wind up focusing on a subset of all domains. Which then means that the max threads per host limit (for polite crawling) starts killing your efficiency.

To work around this, we've modified Nutch to order the fetch list URLs by domain, and constrain the max # of URLs per domain based on the total number of URLs to be fetched, and the number of threads we're using.

The next step is to turn on HTTP keep-alive, which would be pretty easy other than some funky issues we've run into with the http connection pool, and the fact that the current protocol plug-in API doesn't give us a good channel to pass back the info that the fetch thread needs to control the keep alive process.

2. With a vertical crawl, you seem to wind up at "Big Bob's Server" sooner/more often than with a breadth first crawl. Going wide means you spend a lot more time on sites with lots of pages, and these are typically higher performance/better behaved. With vertical, we seem to hit a lot more slow/nonconforming servers, which then kill our fetch performance.

Typical issues are things like servers sending back an endless stream of HTTP response headers, or trickling data back for a big file.

To work around this, we've implemented support for monitoring thread performance and interrupting threads that are taking too long.

3. With a vertical crawl, you typically want to keep the number of fetched URLs (per loop) at a pretty low percentage relative to the total number of unfetched URLs in the WebDB. This helps with maintaining good crawl focus.

The problem we've run into is that the percentage of time spent updating the WebDB, once you're past about 10M pages, starts to dominate the total crawl time.

For example, we can fetch 200K pages in an hour on one machine, but then it takes us 2.5 hours to update a 15M page WebDB with the results. We can do a fetch in parallel with the update from the previous crawl, but that doesn't buy us much. The typical solution is to increase the number of URLs fetched, to balance out the fetch vs. update time, but this winds up wasting bandwidth when most of the extra URLs you fetch aren't all that interesting.

We're hoping that switching to Nutch 0.8 will help solve this issue. We're in the middle of integrating our mods from 0.7 - don't know how hard this will be.

4. We'd like to save multiple scores for a page, for the case where we're applying multiple scoring functions to a page. But modifying the WebDB to support a set of scores per page, while not hurting performance, seems tricky.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to