+1

I have read the paper about OPIc and it seam very good. I think it a must for Nutch to have good (and fast) rank algo webgraph based. I have fetched about 250 milions of pages and what I see is that the only inlinks count is not good for big crawl and quality results.

Thanks,

Massimo

Il giorno 29/set/05, alle ore 23:38, Doug Cutting ha scritto:

Here's some interesting stuff about OPIC, an easy-to-calculate link- based measure of page quality. I'm going to read the papers, and if it is a good as it sounds, perhaps implement this in the mapred branch. Does anyone have experience with OPIC?

-------- Original Message --------
Subject: Fetch list priority
Date: Thu, 29 Sep 2005 10:57:31 +0200
From: Carlos Alberto-Alejandro CASTILLO-Ocaranza
Organization: Universitat Pompeu Fabra

Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne
during the OSWIR workshop.

I told you I would contact you about the priorities of the crawler; and that there were best strategies than using log(indegree). I suggested to
use OPIC (online page importance computation).

OPIC is described here by Abiteboul et al.:

http://www.citeulike.org/user/ChaTo/article/240858

We did experiments with OPIC in two collections of 2-million pages each,
and we tested that these collections have the same power-law exponents
that the full web [I'm attaching a graph of Pagerank vs page
downloaded]. Ordering pages by indegree is as bad as random:

http://www.citeulike.org/user/ChaTo/article/240824

http://www.citeulike.org/user/ChaTo/article/240898

Why? Because the crawler tends to focus in a few Web sites. See for
instance Boldi et al.  "Do your worst to make the best":

http://www.citeulike.org/user/ChaTo/article/240866

====================================================================== =

Here is the general idea of OPIC: at the beginning, each page has the
same score. Let's call it 'opic':

  for all initial pages i:
     opic[i] = 1;

Whenever you find a link:

  opic[destination] += opic[source] / outdegree[source];

This is it. Abiteboul's paper proves that this converges even in a
changing graph, and that it is a good estimator of quality. He also
suggests using the history of a page to keep it's opic across crawls,
but even without the history we have seen that it works quite well.

In your case, what you do in org.apache.nutch.tools.FetchListTool is:
    ...
    String[] anchors = dbAnchors.getAnchors(page.getURL());
    curScore.set(scoreByLinkCount ?
      (float)Math.log(anchors.length+1) : page.getScore());
    ...

You need something different, because you will have to read the scores
of the pages that are pointing to your page. You can do it by (a)
keeping or reading the scores of the inlinks to each page or (b) do this
cycle for the source pages in the other order:

   for each page P in the webdb:
     for each outlinks in page P
       opic[destination] += opic[P] / outdegree[P];

Note that to make this more effective you must also update the 'opic' of the pages you already crawled, and that I think you should avoid self-links.

The 'opic' scores will also be statistically distributed according to a
power-law so it's sensible to use log(opic) when combining this with
other scores with a different distribution, such as text similarity.

====================================================================== ==

I hope this is useful for you.

All the best,

--
ChaTo    = Carlos Alberto-Alejandro CASTILLO-Ocaranza, PhD



Reply via email to