Re: Host-level stats, ranking and recrawl

Chris Schneider Wed, 19 Sep 2007 09:03:06 -0700

Andrzej, et. al.,

At 9:38 PM +0200 9/17/07, Andrzej Bialecki wrote:
>I was recently reading again some scoring-related papers, and found some 
>interesting data in a paper by Baeza-Yates et al, "Crawling a Country: Better 
>Strategies than Breadth-First for Web Page Ordering" 
>(http://citeseer.ist.psu.edu/730674.html).
>
>This paper compares various strategies for prioritizing a crawl of unfetched 
>pages. Among others, it compared the OPIC scoring and a simple strategy which 
>is called "large sites first". This strategy prioritizes pages from large 
>sites and deprioritizes pages from small / medium sites. In order to measure 
>the effectiveness the authors used the value of accumulated PageRank vs. the 
>percentage of crawled pages - the strategy that ensures quick ramp-up of 
>aggregate pagerank is the best.
>
>A bit surprisingly, they found that large-sites-first wins over OPIC:
>
>"Breadth-first is close to the best strategies for the first 20-30% of pages, 
>but after that it becomes less efficient.
> The strategies batch-pagerank, larger-sites-first and OPIC have better 
> performance than the other strategies, with an advantage towards 
> larger-sites-first when the desired coverage is high. These strategies can 
> retrieve about half of the Pagerank value of their domains downloading only 
> around 20-30% of the pages."
>
>Nutch currently uses OPIC-like scoring for this, so most likely it suffers 
>from the same symptoms (the authors also mention a relatively poor OPIC 
>performance at the beginning of a crawl).
>
>Nutch doesn't collect at the moment any host-level statistics, so we couldn't 
>use the other strategy even if we wanted.
>
>What if we added a host-level DB to Nutch? Arguments against this: it's an 
>additional data structure to maintain, and this adds complexity to the system; 
>it's an additional step in the workflow (-> it takes longer time to complete 
>one cycle of crawling). Arguments for are the following: we could implement 
>the above scoring method ;), plus the host-level statistics are good for 
>detecting spam sites, limiting the crawl by site size, etc.
>
>We could start by implementing a tool to collect such statistics from CrawlDb 
>- this should be a trivial map-reduce job, so if anyone wants to take a crack 
>at this it would be a good exercise ... ;)


Stefan Groschupf developed a tool (with a little help from me) called 
DomainStats that collects such domain-level data from the crawl results (both 
crawldb and segment data). We use it to count both pages crawled in each domain 
and pages crawled that meet a "technical" threshold, since the tool can be used 
to select for various field and metadata conditions when counting pages. We use 
the results to create a "white list" of the most technical domains in which to 
focus our next crawl. Domains and sub-domains are counted separately, so you 
get separate counts for www.apache.org, apache.org, and org.

Is there a Jira ticket open for this? If not, I could create one and submit a 
patch. We're currently using a Nutch code base that originated around 417928, 
but I think this is pretty self-contained.

Let me know,

- Schmed
-- 
----------------------------
Chris Schneider
Krugle, Inc.
http://www.krugle.com
[EMAIL PROTECTED]
----------------------------

Re: Host-level stats, ranking and recrawl

Reply via email to