Hi Andrzej, This sounds like a good addition to the current system IMO. It would especially be helpful for building a generic web search or for building a domain-specific search where you would have an algorithm to prioritize which sites to crawl for your domain.
I would go one step further and say that we should consider storing domain level stats and even ip level stats if possible. For e.g., how many pages do we have from each host/domain/ip (H/D/I), what is the avg. error rate while crawling pages for a H/D/I, what is the number of dynamic pages from an H/D/I, what is the avg. size of a page, the avg. response time from the H/D/I etc. These stats would be very useful to improve the crawler efficiency as well. For e.g., if we know that a host/domain's error rate is very high, the scoring plugin can penalize urls from that host/domain so they are deprioritized while crawling. Also, based on the avg. response time from a host/domain, we can mix appropriate number of pages from various sites in a fetchlist so that the fetch can be completed in a certain time. Currently, we have a global property max.pages.per.host (something like that). Instead of that, let's say we input the amount of time that we wanna spend in one fetch. Then by computing the estimated response time from a site, we can mix more pages from faster sites and fewer from slow sites. Last, as Andrzej said - aggregated stats are useful for spam detection. Let's say you identified a host as spam. There is a high probability that other hosts from the same domain are spam (except for portal sites like geocities.com of course). Basically, what I am trying to say is that this is definitely something we should seriously consider integrating inside Nutch - a big thumbs up from me :) Regards, -vishal. -----Original Message----- From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] Sent: Tuesday, September 18, 2007 1:09 AM To: [email protected] Subject: Host-level stats, ranking and recrawl Hi, I was recently reading again some scoring-related papers, and found some interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better Strategies than Breadth-First for Web Page Ordering" (http://citeseer.ist.psu.edu/730674.html). This paper compares various strategies for prioritizing a crawl of unfetched pages. Among others, it compared the OPIC scoring and a simple strategy which is called "large sites first". This strategy prioritizes pages from large sites and deprioritizes pages from small / medium sites. In order to measure the effectiveness the authors used the value of accumulated PageRank vs. the percentage of crawled pages - the strategy that ensures quick ramp-up of aggregate pagerank is the best. A bit surprisingly, they found that large-sites-first wins over OPIC: "Breadth-first is close to the best strategies for the first 20-30% of pages, but after that it becomes less efficient. The strategies batch-pagerank, larger-sites-first and OPIC have better performance than the other strategies, with an advantage towards larger-sites-first when the desired coverage is high. These strategies can retrieve about half of the Pagerank value of their domains downloading only around 20-30% of the pages." Nutch currently uses OPIC-like scoring for this, so most likely it suffers from the same symptoms (the authors also mention a relatively poor OPIC performance at the beginning of a crawl). Nutch doesn't collect at the moment any host-level statistics, so we couldn't use the other strategy even if we wanted. What if we added a host-level DB to Nutch? Arguments against this: it's an additional data structure to maintain, and this adds complexity to the system; it's an additional step in the workflow (-> it takes longer time to complete one cycle of crawling). Arguments for are the following: we could implement the above scoring method ;), plus the host-level statistics are good for detecting spam sites, limiting the crawl by site size, etc. We could start by implementing a tool to collect such statistics from CrawlDb - this should be a trivial map-reduce job, so if anyone wants to take a crack at this it would be a good exercise ... ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
