Have a look at DefaultSimilarity.java. I think the math works once
you follow the formula: See http://lucene.apache.org/java/2_3_1/scoring.html
HTH, Grant
On Mar 10, 2008, at 6:05 PM, João Rodrigues wrote:
Hello all!
I've asked here a few days ago if I could get a "raw" tf-idf score
out of
lucene's methods. I was kindly advised to hack my way through the
"explain"
method. I have, but I can't make any sense of the information which
there is
stated. Here's a print from a search.explain. My comments & doubts
are along
in bold:
Lucene Score: 1.000000
Explanation:
1.9983159 = (MATCH) weight(contents:chaperone in 73615), product of:
0.99999994 = queryWeight(contents:chaperone), product of:
7.3838615 = idf(docFreq=137, numDocs=81725) *-> I calculated this
as
2.7756 or 6.3911 (if using Log or Ln)*
From DefaultSimilarity.java:
public float idf(int docFreq, int numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}
ln(81725/(137+1)) + 1 = 6.38 + 1 = 7.38
0.13543049 = queryNorm
1.998316 = (MATCH) fieldWeight(contents:chaperone in 73615),
product of:
1.7320508 = tf(termFreq(contents:chaperone)=3) *-> The doc has 32
tokens
(according to luke) and 3/32 != 1.7320508*
3^0.5 = 1.73...
7.3838615 = idf(docFreq=137, numDocs=81725)
0.15625 = fieldNorm(field=contents, doc=73615)
---------------------------------------------------------------------------
So, what am I missing? I read the regular tf-idf rule from
wikipedia, along
with some other text books I found, so I'm pretty sure it is ok. I
didn't
set any boost factor or anything (otherwise it would also appear
here I
suppose). I am using the Standard Analyzer, thus accounting for a
higher tf,
but not that enormity.
--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ