score calculation

POIRIER David Fri, 06 Jun 2008 08:41:38 -0700

Hello,


My index is, I guess, a little bit unorthodox: out of 75 000 entries, 55
000 comes from a single source and the last 20 000 from 15 other
sources. When executing a query though I would like to see all the
sources on the same starting line, and it doesn't seem to be the case:
the documents coming from the "big" source are always last in the
results lists.

 

I then checked how the score is calculated:

 

score(q,d) = coord(q,d) . queryNorm(q) . SUM( tf(t in d) . idf(t)^2
t.getBoost() . norm(t, d) )

where q is a query, t is a term and d a document

 

 

I won't go into all the details, for that check the following link:
http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc//org/apac
he/lucene/search/Similarity.html
<http://hudson.zones.apache.org/hudson/job/Lucene-trunk/javadoc/org/apac
he/lucene/search/Similarity.html>  but using excellent Luke application,
written by Andrezej Bialecki, I understood that my problem is due to the
last element of the equation: norm(t, d)..

 

Using Luke I checked the detailed explanation of a document after
executing a simple query on the index, something like
content:"anExpression". The score was equal to tf( ) * idf() *
fieldNorm( ), and in there, fieldNorm = 0.0001, completely killing the
score. this fieldNorn() value is, I guess, equal to the t.getBoost( ) .
norm( ) we see in the main equation? When checking the details of the
document fields for all the documents in Luke, I could read that the
boost field was amazingly low at 0.00419... Why is that? What is this
boost value exactly? The document boost? 

 

Note: Based on Lucene documentation, norm( ) encapsulate the document
boost, the field boost and the lengthNorm( ) value. This lengthNorm( )
value is computed when the document is added to the index in accordance
with the number of tokens of this field in this document. It should, in
theory, be of no impact here.

 

I went a little bit further, creating 3 indexes for 3 different sources
(one index per source). I got the following metrics in Luke:

 

 

Index size: 190

boost value for all documents: 0.08222

fieldNorm( ) value for all documents = 0.0024

 

Index size: 789

boost value for all documents: 0.067

fieldNorm( ) value for all documents = 0.0020

 

Index size: 6838

boost value for all documents: 0.012

fieldNorm( ) value for all documents = 0.0004

 

Index size: 50 000

boost value for all documents: 0.004193

fieldNorm( ) value for all documents = 0.0001

 

As we can see there is a clear correlation between the size of the index
and the boost value associated with. Take note that when looking at a
single source, the size and URL patterns of every referenced documents
are alike.

 

There is indeed something strange. If we agree that the fieldNorm( )
value is function of the boost value, then I have a problem with the
boost. The boost value displayed in Luke should be the same! Except if
the document boost or the field boost is, somehow, linked to the size of
the index... or the size of a segment? or the URL pattern?

 

Basically my question is: How is the boost value displayed in Luke as a
field for every indexed document calculated?

 

Thank you and good week-end,

 

David

score calculation

Reply via email to