RE: score calculation

POIRIER David Mon, 09 Jun 2008 03:47:50 -0700

Matt,

There is one key thing that I have just understood: how Nutch calculates
the boost value found in the boost metadata field that you find with
every document inside your index: the scoring-opic plugin.


As you probably already know, Nutch is being developed using a
plugin-based architecture. By creating new plugins you can modify the
Nutch's behavior when crawling a source. Common modifications involve
parsing plugins and query plugins; more details on that are available on
the wiki.

But what I have found this morning is that the scoric-opic plugin is not
called when a search is executed on an index, as I though, but when the
fetched documents are being parsed. The score calculated by this plugin
is, I might be mistaking but I don't think so, the document boost value
and not the actual score linked to a document and based on a query. I
will call it the docScoreRelatedToAQuery.

As you now know, the docScoreRelatedToAQuery formula is:
 
docScoreRelatedToAQuery (q,d) = coord(q,d) . queryNorm(q) . SUM( tf(t in
d) . idf(t)^2 t.getBoost() . norm(t, d) )

Where

norm(t, d) = doc.getBoost() . lengthNorm(field) . f.getBoost()

If we take for granted that every generated field has a default boost
value of 1.0, the equation becomes:

norm(t, d) = doc.getBoost() . lengthNorm(field)

And this is where a problem arises for me. While the lengthNorm (field)
value was ok, the doc.getBoost() was returning astronomically low values
for certain indexes.

Looking at the source code of the scoring-opic plugin I understood that
this value was directly linked to the number of outlinks that the parent
document has. Basically, if your seed page is a generated page with
thousands of links toward child documents, those children will see there
doc.getBoost() value be extremely low and this will directly impact
their position inside a result set.

I modified the plugin, eliminating the unwanted division, and I now have
a doc.getBoost() value of 1.0 for every document. In my case, the norm
equation is now:

norm(t, d) = lengthNorm(field)

Note: the involved java class is
org.apache.nutch.scoring.opic.OPICScoringFilter found under the
${NUTCH_HOME}/src/plugins/scoring-opic directory

And things are good again and I will be able to enjoy tonight's Italy vs
Netherland game :-)

Hope this helps,


David


-----Original Message-----
From: vanderkerkoff [mailto:[EMAIL PROTECTED] 
Sent: lundi, 9. juin 2008 12:09
To: [email protected]
Subject: Re: score calculation


Hi David

I'm trying to understand the same thing, how the scoring works when you
have
more than one site being indexed.

I'll keep you informed if I work anything out, although I'm a bit behind
you
for certain.

I'm playing catch up :-)

Matt
-- 
View this message in context:
http://www.nabble.com/score-calculation-tp17695314p17729867.html
Sent from the Nutch - User mailing list archive at Nabble.com.

RE: score calculation

Reply via email to