Matt,
There is one key thing that I have just understood: how Nutch calculates
the boost value found in the boost metadata field that you find with
every document inside your index: the scoring-opic plugin.
As you probably already know, Nutch is being developed using a
plugin-based architecture. By creating new plugins you can modify the
Nutch's behavior when crawling a source. Common modifications involve
parsing plugins and query plugins; more details on that are available on
the wiki.
But what I have found this morning is that the scoric-opic plugin is not
called when a search is executed on an index, as I though, but when the
fetched documents are being parsed. The score calculated by this plugin
is, I might be mistaking but I don't think so, the document boost value
and not the actual score linked to a document and based on a query. I
will call it the docScoreRelatedToAQuery.
As you now know, the docScoreRelatedToAQuery formula is:
docScoreRelatedToAQuery (q,d) = coord(q,d) . queryNorm(q) . SUM( tf(t in
d) . idf(t)^2 t.getBoost() . norm(t, d) )
Where
norm(t, d) = doc.getBoost() . lengthNorm(field) . f.getBoost()
If we take for granted that every generated field has a default boost
value of 1.0, the equation becomes:
norm(t, d) = doc.getBoost() . lengthNorm(field)
And this is where a problem arises for me. While the lengthNorm (field)
value was ok, the doc.getBoost() was returning astronomically low values
for certain indexes.
Looking at the source code of the scoring-opic plugin I understood that
this value was directly linked to the number of outlinks that the parent
document has. Basically, if your seed page is a generated page with
thousands of links toward child documents, those children will see there
doc.getBoost() value be extremely low and this will directly impact
their position inside a result set.
I modified the plugin, eliminating the unwanted division, and I now have
a doc.getBoost() value of 1.0 for every document. In my case, the norm
equation is now:
norm(t, d) = lengthNorm(field)
Note: the involved java class is
org.apache.nutch.scoring.opic.OPICScoringFilter found under the
${NUTCH_HOME}/src/plugins/scoring-opic directory
And things are good again and I will be able to enjoy tonight's Italy vs
Netherland game :-)
Hope this helps,
David
-----Original Message-----
From: vanderkerkoff [mailto:[EMAIL PROTECTED]
Sent: lundi, 9. juin 2008 12:09
To: [email protected]
Subject: Re: score calculation
Hi David
I'm trying to understand the same thing, how the scoring works when you
have
more than one site being indexed.
I'll keep you informed if I work anything out, although I'm a bit behind
you
for certain.
I'm playing catch up :-)
Matt
--
View this message in context:
http://www.nabble.com/score-calculation-tp17695314p17729867.html
Sent from the Nutch - User mailing list archive at Nabble.com.