[Nutch-dev] Per-page crawling policy

Andrzej Bialecki Thu, 05 Jan 2006 06:00:12 -0800

Hi,

I've been toying with the following idea, which is an extension of theexisting URLFilter mechanism and the concept of a "crawl frontier".

Let's suppose we have several initial seed urls, each with a differentsubjective quality. We would like to crawl these, and expand the"crawling frontier" using outlinks. However, we don't want to do ituniformly for every initial url, but rather propagate certain "crawlingpolicy" through the expanding trees of linked pages. This "crawlingpolicy" could consist of url filters, scoring methods, etc - basicallyanything configurable in Nutch could be included in this "policy".Perhaps it could even be the new version of non-static NutchConf ;-)

Then, if a given initial url is a known high-quality source, we wouldlike to apply a "favor" policy, where we e.g. add pages linked from thaturl, and in doing so we give them a higher score. Recursively, we couldapply the same policy for the next generation pages, or perhaps only forpages belonging to the same domain. So, in a sense the original notionof high-quality would cascade down to other linked pages. The importantaspect of this to note is that all newly discovered pages would besubject to the same policy - unless we have compelling reasons to switchthe policy (from "favor" to "default" or to "distrust"), which at thatpoint would essentially change the shape of the expanding frontier.

If a given initial url is a known spammer, we would like to apply a"distrust" policy for adding pages linked from that url (e.g. adding ornot adding, if adding then lowering their score, or applying differentscore calculation). And recursively we could apply a similar policy of"distrust" to any pages discovered this way. We could also change thepolicy on the way, if there are compelling reasons to do so. This meansthat we could follow some high-quality links from low-quality pages,without drilling down the sites which are known to be of low quality.

Special care needs to be taken if the same page is discovered from pageswith different policies, I haven't thought about this aspect yet... ;-)


What would be the benefits of such approach?

* the initial page + policy would both control the expanding crawlingfrontier, and it could be differently defined for different startingpages. I.e. in a single web database we could keep different"collections" or "areas of interest" with differently specifiedpolicies. But still we could reap the benefits of a single web db,namely the link information.

* URLFilters could be grouped into several policies, and it would beeasy to switch between them, or edit them.

* if the crawl process realizes it ended up on a spam page, it canswitch the page policy to "distrust", or the other way around, and stopcrawling unwanted content. From now on the pages linked from that pagewill follow the new policy. In other words, if a crawling frontierreaches pages with known quality problems, it would be easy to changethe policy on-the-fly to avoid them or pages linked from them, withoutresorting to modifications of URLFilters.

Some of the above you can do even now with URLFilters, but any changeyou do now has global consequences. You may also end up with awfullycomplicated rules if you try to cover all cases in one rule set.

How to implement it? Surprisingly, I think that it's very simple - justadding a CrawlDatum.policyId field would suffice, assuming we have ameans to store and retrieve these policies by ID; and then instantiateit and call appropriate methods whenever we use today the URLFilters anddo the score calculations.


Any comments?

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Per-page crawling policy

Reply via email to