Hi, On 9/17/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Hi, > > I was recently reading again some scoring-related papers, and found some > interesting data in a paper by Baeza-Yates et al, "Crawling a Country: > Better Strategies than Breadth-First for Web Page Ordering" > (http://citeseer.ist.psu.edu/730674.html). > > This paper compares various strategies for prioritizing a crawl of > unfetched pages. Among others, it compared the OPIC scoring and a simple > strategy which is called "large sites first". This strategy prioritizes > pages from large sites and deprioritizes pages from small / medium > sites. In order to measure the effectiveness the authors used the value > of accumulated PageRank vs. the percentage of crawled pages - the > strategy that ensures quick ramp-up of aggregate pagerank is the best. > > A bit surprisingly, they found that large-sites-first wins over OPIC: > > "Breadth-first is close to the best strategies for the first 20-30% of > pages, but after that it becomes less efficient. > The strategies batch-pagerank, larger-sites-first and OPIC have better > performance than the other strategies, with an advantage towards > larger-sites-first when the desired coverage is high. These strategies > can retrieve about half of the Pagerank value of their domains > downloading only around 20-30% of the pages." > > Nutch currently uses OPIC-like scoring for this, so most likely it > suffers from the same symptoms (the authors also mention a relatively > poor OPIC performance at the beginning of a crawl). > > Nutch doesn't collect at the moment any host-level statistics, so we > couldn't use the other strategy even if we wanted. > > What if we added a host-level DB to Nutch? Arguments against this: it's > an additional data structure to maintain, and this adds complexity to > the system; it's an additional step in the workflow (-> it takes longer > time to complete one cycle of crawling). Arguments for are the > following: we could implement the above scoring method ;), plus the > host-level statistics are good for detecting spam sites, limiting the > crawl by site size, etc.
Another +1. We definitely need domain-level statistics anyway, so being able to implement large-sites-first is a nice bonus, I think :) > > We could start by implementing a tool to collect such statistics from > CrawlDb - this should be a trivial map-reduce job, so if anyone wants to > take a crack at this it would be a good exercise ... ;) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney
