Re: Host-level stats, ranking and recrawl

Doğacan Güney Tue, 18 Sep 2007 12:44:00 -0700

Hi,

On 9/17/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I was recently reading again some scoring-related papers, and found some
> interesting data in a paper by Baeza-Yates et al, "Crawling a Country:
> Better Strategies than Breadth-First for Web Page Ordering"
> (http://citeseer.ist.psu.edu/730674.html).
>
> This paper compares various strategies for prioritizing a crawl of
> unfetched pages. Among others, it compared the OPIC scoring and a simple
> strategy which is called "large sites first". This strategy prioritizes
> pages from large sites and deprioritizes pages from small / medium
> sites. In order to measure the effectiveness the authors used the value
> of accumulated PageRank vs. the percentage of crawled pages - the
> strategy that ensures quick ramp-up of aggregate pagerank is the best.
>
> A bit surprisingly, they found that large-sites-first wins over OPIC:
>
> "Breadth-first is close to the best strategies for the first 20-30% of
> pages, but after that it becomes less efficient.
>   The strategies batch-pagerank, larger-sites-first and OPIC have better
> performance than the other strategies, with an advantage towards
> larger-sites-first when the desired coverage is high. These strategies
> can retrieve about half of the Pagerank value of their domains
> downloading only around 20-30% of the pages."
>
> Nutch currently uses OPIC-like scoring for this, so most likely it
> suffers from the same symptoms (the authors also mention a relatively
> poor OPIC performance at the beginning of a crawl).
>
> Nutch doesn't collect at the moment any host-level statistics, so we
> couldn't use the other strategy even if we wanted.
>
> What if we added a host-level DB to Nutch? Arguments against this: it's
> an additional data structure to maintain, and this adds complexity to
> the system; it's an additional step in the workflow (-> it takes longer
> time to complete one cycle of crawling). Arguments for are the
> following: we could implement the above scoring method ;), plus the
> host-level statistics are good for detecting spam sites, limiting the
> crawl by site size, etc.


Another +1. We definitely need domain-level statistics anyway, so
being able to implement large-sites-first is a nice bonus, I think :)

>
> We could start by implementing a tool to collect such statistics from
> CrawlDb - this should be a trivial map-reduce job, so if anyone wants to
> take a crack at this it would be a good exercise ... ;)
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Host-level stats, ranking and recrawl

Reply via email to