Andrzej, et. al., At 9:38 PM +0200 9/17/07, Andrzej Bialecki wrote: >I was recently reading again some scoring-related papers, and found some >interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better >Strategies than Breadth-First for Web Page Ordering" >(http://citeseer.ist.psu.edu/730674.html). > >This paper compares various strategies for prioritizing a crawl of unfetched >pages. Among others, it compared the OPIC scoring and a simple strategy which >is called "large sites first". This strategy prioritizes pages from large >sites and deprioritizes pages from small / medium sites. In order to measure >the effectiveness the authors used the value of accumulated PageRank vs. the >percentage of crawled pages - the strategy that ensures quick ramp-up of >aggregate pagerank is the best. > >A bit surprisingly, they found that large-sites-first wins over OPIC: > >"Breadth-first is close to the best strategies for the first 20-30% of pages, >but after that it becomes less efficient. > The strategies batch-pagerank, larger-sites-first and OPIC have better > performance than the other strategies, with an advantage towards > larger-sites-first when the desired coverage is high. These strategies can > retrieve about half of the Pagerank value of their domains downloading only > around 20-30% of the pages." > >Nutch currently uses OPIC-like scoring for this, so most likely it suffers >from the same symptoms (the authors also mention a relatively poor OPIC >performance at the beginning of a crawl). > >Nutch doesn't collect at the moment any host-level statistics, so we couldn't >use the other strategy even if we wanted. > >What if we added a host-level DB to Nutch? Arguments against this: it's an >additional data structure to maintain, and this adds complexity to the system; >it's an additional step in the workflow (-> it takes longer time to complete >one cycle of crawling). Arguments for are the following: we could implement >the above scoring method ;), plus the host-level statistics are good for >detecting spam sites, limiting the crawl by site size, etc. > >We could start by implementing a tool to collect such statistics from CrawlDb >- this should be a trivial map-reduce job, so if anyone wants to take a crack >at this it would be a good exercise ... ;)
Stefan Groschupf developed a tool (with a little help from me) called DomainStats that collects such domain-level data from the crawl results (both crawldb and segment data). We use it to count both pages crawled in each domain and pages crawled that meet a "technical" threshold, since the tool can be used to select for various field and metadata conditions when counting pages. We use the results to create a "white list" of the most technical domains in which to focus our next crawl. Domains and sub-domains are counted separately, so you get separate counts for www.apache.org, apache.org, and org. Is there a Jira ticket open for this? If not, I could create one and submit a patch. We're currently using a Nutch code base that originated around 417928, but I think this is pretty self-contained. Let me know, - Schmed -- ---------------------------- Chris Schneider Krugle, Inc. http://www.krugle.com [EMAIL PROTECTED] ----------------------------
