[Nutch-dev] Re: Per-page crawling policy

Ken Krugler Mon, 16 Jan 2006 07:40:19 -0800

Hi Andrzej,

I've been toying with the following idea, which is an extension ofthe existing URLFilter mechanism and the concept of a "crawlfrontier".
Let's suppose we have several initial seed urls, each with adifferent subjective quality. We would like to crawl these, andexpand the "crawling frontier" using outlinks. However, we don'twant to do it uniformly for every initial url, but rather propagatecertain "crawling policy" through the expanding trees of linkedpages. This "crawling policy" could consist of url filters, scoringmethods, etc - basically anything configurable in Nutch could beincluded in this "policy". Perhaps it could even be the new versionof non-static NutchConf ;-)
Then, if a given initial url is a known high-quality source, wewould like to apply a "favor" policy, where we e.g. add pages linkedfrom that url, and in doing so we give them a higher score.Recursively, we could apply the same policy for the next generationpages, or perhaps only for pages belonging to the same domain. So,in a sense the original notion of high-quality would cascade down toother linked pages. The important aspect of this to note is that allnewly discovered pages would be subject to the same policy - unlesswe have compelling reasons to switch the policy (from "favor" to"default" or to "distrust"), which at that point would essentiallychange the shape of the expanding frontier.
If a given initial url is a known spammer, we would like to apply a"distrust" policy for adding pages linked from that url (e.g. addingor not adding, if adding then lowering their score, or applyingdifferent score calculation). And recursively we could apply asimilar policy of "distrust" to any pages discovered this way. Wecould also change the policy on the way, if there are compellingreasons to do so. This means that we could follow some high-qualitylinks from low-quality pages, without drilling down the sites whichare known to be of low quality.
Special care needs to be taken if the same page is discovered frompages with different policies, I haven't thought about this aspectyet... ;-)

This sounds like the TrustRank algorithm. Seehttp://www.vldb.org/conf/2004/RS15P3.PDF. This talks about trustattenuation via trust dampening (reducing the trust level as you getfurther from a trusted page) and trust splitting (OPIC-like approach).

What would be the benefits of such approach?
* the initial page + policy would both control the expandingcrawling frontier, and it could be differently defined for differentstarting pages. I.e. in a single web database we could keepdifferent "collections" or "areas of interest" with differentlyspecified policies. But still we could reap the benefits of a singleweb db, namely the link information.
* URLFilters could be grouped into several policies, and it would beeasy to switch between them, or edit them.
* if the crawl process realizes it ended up on a spam page, it canswitch the page policy to "distrust", or the other way around, andstop crawling unwanted content. From now on the pages linked fromthat page will follow the new policy. In other words, if a crawlingfrontier reaches pages with known quality problems, it would be easyto change the policy on-the-fly to avoid them or pages linked fromthem, without resorting to modifications of URLFilters.
Some of the above you can do even now with URLFilters, but anychange you do now has global consequences. You may also end up withawfully complicated rules if you try to cover all cases in one ruleset.

The approach we took (with Nutch 0.7) is to use the Page nextScorefield as a 'crawl priority' field. We apply a scoring function toeach page, that takes into account the contents of the page, thepage's domain (we have a hand-selected set of known "good" domainsand sub-domains), and the page's OPIC score. This gets divided upamong the valid outlinks (after passing these through the URLfilter), and summed into the appropriate Page.nextScore entries inthe web db.

Then at crawl time we sort by nextScore, and pick a percentage of thetotal unfetched links.

This gives us a pretty good depth-first crawl, where "depth" isdefined by the page content scoring function and the set of trusteddomains.


The four issues we've run into with this approach are:

1. Even with a pretty broad area of interest, you wind up focusing ona subset of all domains. Which then means that the max threads perhost limit (for polite crawling) starts killing your efficiency.

To work around this, we've modified Nutch to order the fetch listURLs by domain, and constrain the max # of URLs per domain based onthe total number of URLs to be fetched, and the number of threadswe're using.

The next step is to turn on HTTP keep-alive, which would be prettyeasy other than some funky issues we've run into with the httpconnection pool, and the fact that the current protocol plug-in APIdoesn't give us a good channel to pass back the info that the fetchthread needs to control the keep alive process.

2. With a vertical crawl, you seem to wind up at "Big Bob's Server"sooner/more often than with a breadth first crawl. Going wide meansyou spend a lot more time on sites with lots of pages, and these aretypically higher performance/better behaved. With vertical, we seemto hit a lot more slow/nonconforming servers, which then kill ourfetch performance.

Typical issues are things like servers sending back an endless streamof HTTP response headers, or trickling data back for a big file.

To work around this, we've implemented support for monitoring threadperformance and interrupting threads that are taking too long.

3. With a vertical crawl, you typically want to keep the number offetched URLs (per loop) at a pretty low percentage relative to thetotal number of unfetched URLs in the WebDB. This helps withmaintaining good crawl focus.

The problem we've run into is that the percentage of time spentupdating the WebDB, once you're past about 10M pages, starts todominate the total crawl time.

For example, we can fetch 200K pages in an hour on one machine, butthen it takes us 2.5 hours to update a 15M page WebDB with theresults. We can do a fetch in parallel with the update from theprevious crawl, but that doesn't buy us much. The typical solution isto increase the number of URLs fetched, to balance out the fetch vs.update time, but this winds up wasting bandwidth when most of theextra URLs you fetch aren't all that interesting.

We're hoping that switching to Nutch 0.8 will help solve this issue.We're in the middle of integrating our mods from 0.7 - don't know howhard this will be.

4. We'd like to save multiple scores for a page, for the case wherewe're applying multiple scoring functions to a page. But modifyingthe WebDB to support a set of scores per page, while not hurtingperformance, seems tricky.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-470-9200


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Per-page crawling policy

Reply via email to