IlTrovatore check: e' SPAM? Re: [Fwd: Fetch list priority]

massimo miccoli Mon, 03 Oct 2005 11:10:35 -0700

+1

I have read the paper about OPIc and it seam very good. I think it amust for Nutch to have good (and fast) rank algo webgraph based. Ihave fetched about 250 milions of pages and what I see is that theonly inlinks count is not good for big crawl and quality results.


Thanks,

Massimo

Il giorno 29/set/05, alle ore 23:38, Doug Cutting ha scritto:

Here's some interesting stuff about OPIC, an easy-to-calculate link-based measure of page quality. I'm going to read the papers, andif it is a good as it sounds, perhaps implement this in the mapredbranch. Does anyone have experience with OPIC?


-------- Original Message --------
Subject: Fetch list priority
Date: Thu, 29 Sep 2005 10:57:31 +0200
From: Carlos Alberto-Alejandro CASTILLO-Ocaranza
Organization: Universitat Pompeu Fabra

Hi Doug, I'm ChaTo, developer of the WIRE crawler; we met in Compiegne
during the OSWIR workshop.

I told you I would contact you about the priorities of the crawler;andthat there were best strategies than using log(indegree). Isuggested to

use OPIC (online page importance computation).

OPIC is described here by Abiteboul et al.:

http://www.citeulike.org/user/ChaTo/article/240858

We did experiments with OPIC in two collections of 2-million pageseach,

and we tested that these collections have the same power-law exponents
that the full web [I'm attaching a graph of Pagerank vs page
downloaded]. Ordering pages by indegree is as bad as random:

http://www.citeulike.org/user/ChaTo/article/240824

http://www.citeulike.org/user/ChaTo/article/240898

Why? Because the crawler tends to focus in a few Web sites. See for
instance Boldi et al.  "Do your worst to make the best":

http://www.citeulike.org/user/ChaTo/article/240866

=======================================================================


Here is the general idea of OPIC: at the beginning, each page has the
same score. Let's call it 'opic':

  for all initial pages i:
     opic[i] = 1;

Whenever you find a link:

  opic[destination] += opic[source] / outdegree[source];

This is it. Abiteboul's paper proves that this converges even in a
changing graph, and that it is a good estimator of quality. He also
suggests using the history of a page to keep it's opic across crawls,
but even without the history we have seen that it works quite well.

In your case, what you do in org.apache.nutch.tools.FetchListTool is:
    ...
    String[] anchors = dbAnchors.getAnchors(page.getURL());
    curScore.set(scoreByLinkCount ?
      (float)Math.log(anchors.length+1) : page.getScore());
    ...

You need something different, because you will have to read the scores
of the pages that are pointing to your page. You can do it by (a)

keeping or reading the scores of the inlinks to each page or (b) dothis

cycle for the source pages in the other order:

   for each page P in the webdb:
     for each outlinks in page P
       opic[destination] += opic[P] / outdegree[P];

Note that to make this more effective you must also update the'opic' ofthe pages you already crawled, and that I think you should avoidself-links.

The 'opic' scores will also be statistically distributed accordingto a

power-law so it's sensible to use log(opic) when combining this with
other scores with a different distribution, such as text similarity.

========================================================================


I hope this is useful for you.

All the best,

--
ChaTo    = Carlos Alberto-Alejandro CASTILLO-Ocaranza, PhD

IlTrovatore check: e' SPAM? Re: [Fwd: Fetch list priority]

Reply via email to