Host-level stats, ranking and recrawl

Andrzej Bialecki Mon, 17 Sep 2007 15:25:56 -0700

Hi,

I was recently reading again some scoring-related papers, and found someinteresting data in a paper by Baeza-Yates et al, "Crawling a Country:Better Strategies than Breadth-First for Web Page Ordering"(http://citeseer.ist.psu.edu/730674.html).

This paper compares various strategies for prioritizing a crawl ofunfetched pages. Among others, it compared the OPIC scoring and a simplestrategy which is called "large sites first". This strategy prioritizespages from large sites and deprioritizes pages from small / mediumsites. In order to measure the effectiveness the authors used the valueof accumulated PageRank vs. the percentage of crawled pages - thestrategy that ensures quick ramp-up of aggregate pagerank is the best.


A bit surprisingly, they found that large-sites-first wins over OPIC:

"Breadth-first is close to the best strategies for the first 20-30% ofpages, but after that it becomes less efficient.The strategies batch-pagerank, larger-sites-first and OPIC have betterperformance than the other strategies, with an advantage towardslarger-sites-first when the desired coverage is high. These strategiescan retrieve about half of the Pagerank value of their domainsdownloading only around 20-30% of the pages."

Nutch currently uses OPIC-like scoring for this, so most likely itsuffers from the same symptoms (the authors also mention a relativelypoor OPIC performance at the beginning of a crawl).

Nutch doesn't collect at the moment any host-level statistics, so wecouldn't use the other strategy even if we wanted.

What if we added a host-level DB to Nutch? Arguments against this: it'san additional data structure to maintain, and this adds complexity tothe system; it's an additional step in the workflow (-> it takes longertime to complete one cycle of crawling). Arguments for are thefollowing: we could implement the above scoring method ;), plus thehost-level statistics are good for detecting spam sites, limiting thecrawl by site size, etc.

We could start by implementing a tool to collect such statistics fromCrawlDb - this should be a trivial map-reduce job, so if anyone wants totake a crack at this it would be a good exercise ... ;)


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Host-level stats, ranking and recrawl

Reply via email to