Hi,
I was recently reading again some scoring-related papers, and found some
interesting data in a paper by Baeza-Yates et al, "Crawling a Country:
Better Strategies than Breadth-First for Web Page Ordering"
(http://citeseer.ist.psu.edu/730674.html).
This paper compares various strategies for prioritizing a crawl of
unfetched pages. Among others, it compared the OPIC scoring and a simple
strategy which is called "large sites first". This strategy prioritizes
pages from large sites and deprioritizes pages from small / medium
sites. In order to measure the effectiveness the authors used the value
of accumulated PageRank vs. the percentage of crawled pages - the
strategy that ensures quick ramp-up of aggregate pagerank is the best.
A bit surprisingly, they found that large-sites-first wins over OPIC:
"Breadth-first is close to the best strategies for the first 20-30% of
pages, but after that it becomes less efficient.
The strategies batch-pagerank, larger-sites-first and OPIC have better
performance than the other strategies, with an advantage towards
larger-sites-first when the desired coverage is high. These strategies
can retrieve about half of the Pagerank value of their domains
downloading only around 20-30% of the pages."
Nutch currently uses OPIC-like scoring for this, so most likely it
suffers from the same symptoms (the authors also mention a relatively
poor OPIC performance at the beginning of a crawl).
Nutch doesn't collect at the moment any host-level statistics, so we
couldn't use the other strategy even if we wanted.
What if we added a host-level DB to Nutch? Arguments against this: it's
an additional data structure to maintain, and this adds complexity to
the system; it's an additional step in the workflow (-> it takes longer
time to complete one cycle of crawling). Arguments for are the
following: we could implement the above scoring method ;), plus the
host-level statistics are good for detecting spam sites, limiting the
crawl by site size, etc.
We could start by implementing a tool to collect such statistics from
CrawlDb - this should be a trivial map-reduce job, so if anyone wants to
take a crack at this it would be a good exercise ... ;)
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com