Hi,

I was recently reading again some scoring-related papers, and found some interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering" (http://citeseer.ist.psu.edu/730674.html).

This paper compares various strategies for prioritizing a crawl of unfetched pages. Among others, it compared the OPIC scoring and a simple strategy which is called "large sites first". This strategy prioritizes pages from large sites and deprioritizes pages from small / medium sites. In order to measure the effectiveness the authors used the value of accumulated PageRank vs. the percentage of crawled pages - the strategy that ensures quick ramp-up of aggregate pagerank is the best.

A bit surprisingly, they found that large-sites-first wins over OPIC:

"Breadth-first is close to the best strategies for the first 20-30% of pages, but after that it becomes less efficient. The strategies batch-pagerank, larger-sites-first and OPIC have better performance than the other strategies, with an advantage towards larger-sites-first when the desired coverage is high. These strategies can retrieve about half of the Pagerank value of their domains downloading only around 20-30% of the pages."

Nutch currently uses OPIC-like scoring for this, so most likely it suffers from the same symptoms (the authors also mention a relatively poor OPIC performance at the beginning of a crawl).

Nutch doesn't collect at the moment any host-level statistics, so we couldn't use the other strategy even if we wanted.

What if we added a host-level DB to Nutch? Arguments against this: it's an additional data structure to maintain, and this adds complexity to the system; it's an additional step in the workflow (-> it takes longer time to complete one cycle of crawling). Arguments for are the following: we could implement the above scoring method ;), plus the host-level statistics are good for detecting spam sites, limiting the crawl by site size, etc.

We could start by implementing a tool to collect such statistics from CrawlDb - this should be a trivial map-reduce job, so if anyone wants to take a crack at this it would be a good exercise ... ;)

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to