RE: Host-level stats, ranking and recrawl

Vishal Shah Tue, 18 Sep 2007 03:11:09 -0700

Hi Andrzej,

  This sounds like a good addition to the current system IMO. It would
especially be helpful for building a generic web search or for building a
domain-specific search where you would have an algorithm to prioritize which
sites to crawl for your domain.


  I would go one step further and say that we should consider storing domain
level stats and even ip level stats if possible. For e.g., how many pages do
we have from each host/domain/ip (H/D/I), what is the avg. error rate while
crawling pages for a H/D/I, what is the number of dynamic pages from an
H/D/I, what is the avg. size of a page, the avg. response time from the
H/D/I etc.

  These stats would be very useful to improve the crawler efficiency as
well. For e.g., if we know that a host/domain's error rate is very high, the
scoring plugin can penalize urls from that host/domain so they are
deprioritized while crawling. 

Also, based on the avg. response time from a host/domain, we can mix
appropriate number of pages from various sites in a fetchlist so that the
fetch can be completed in a certain time. Currently, we have a global
property max.pages.per.host (something like that). Instead of that, let's
say we input the amount of time that we wanna spend in one fetch. Then by
computing the estimated response time from a site, we can mix more pages
from faster sites and fewer from slow sites.

Last, as Andrzej said - aggregated stats are useful for spam detection.
Let's say you identified a host as spam. There is a high probability that
other hosts from the same domain are spam (except for portal sites like
geocities.com of course).

Basically, what I am trying to say is that this is definitely something we
should seriously consider integrating inside Nutch - a big thumbs up from me
:)

Regards,

-vishal.


-----Original Message-----
From: Andrzej Bialecki [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, September 18, 2007 1:09 AM
To: [email protected]
Subject: Host-level stats, ranking and recrawl

Hi,

I was recently reading again some scoring-related papers, and found some 
interesting data in a paper by Baeza-Yates et al, "Crawling a Country: 
Better Strategies than Breadth-First for Web Page Ordering" 
(http://citeseer.ist.psu.edu/730674.html).

This paper compares various strategies for prioritizing a crawl of 
unfetched pages. Among others, it compared the OPIC scoring and a simple 
strategy which is called "large sites first". This strategy prioritizes 
pages from large sites and deprioritizes pages from small / medium 
sites. In order to measure the effectiveness the authors used the value 
of accumulated PageRank vs. the percentage of crawled pages - the 
strategy that ensures quick ramp-up of aggregate pagerank is the best.

A bit surprisingly, they found that large-sites-first wins over OPIC:

"Breadth-first is close to the best strategies for the first 20-30% of 
pages, but after that it becomes less efficient.
  The strategies batch-pagerank, larger-sites-first and OPIC have better 
performance than the other strategies, with an advantage towards 
larger-sites-first when the desired coverage is high. These strategies 
can retrieve about half of the Pagerank value of their domains 
downloading only around 20-30% of the pages."

Nutch currently uses OPIC-like scoring for this, so most likely it 
suffers from the same symptoms (the authors also mention a relatively 
poor OPIC performance at the beginning of a crawl).

Nutch doesn't collect at the moment any host-level statistics, so we 
couldn't use the other strategy even if we wanted.

What if we added a host-level DB to Nutch? Arguments against this: it's 
an additional data structure to maintain, and this adds complexity to 
the system; it's an additional step in the workflow (-> it takes longer 
time to complete one cycle of crawling). Arguments for are the 
following: we could implement the above scoring method ;), plus the 
host-level statistics are good for detecting spam sites, limiting the 
crawl by site size, etc.

We could start by implementing a tool to collect such statistics from 
CrawlDb - this should be a trivial map-reduce job, so if anyone wants to 
take a crack at this it would be a good exercise ... ;)

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Host-level stats, ranking and recrawl

Reply via email to