Have a look at DefaultSimilarity.java. I think the math works once you follow the formula: See http://lucene.apache.org/java/2_3_1/scoring.html

HTH, Grant

On Mar 10, 2008, at 6:05 PM, João Rodrigues wrote:

Hello all!

I've asked here a few days ago if I could get a "raw" tf-idf score out of lucene's methods. I was kindly advised to hack my way through the "explain" method. I have, but I can't make any sense of the information which there is stated. Here's a print from a search.explain. My comments & doubts are along
in bold:


Lucene Score: 1.000000
Explanation:

1.9983159 = (MATCH) weight(contents:chaperone in 73615), product of:
 0.99999994 = queryWeight(contents:chaperone), product of:
7.3838615 = idf(docFreq=137, numDocs=81725) *-> I calculated this as
2.7756 or 6.3911 (if using Log or Ln)*

From DefaultSimilarity.java:

public float idf(int docFreq, int numDocs) {
    return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
  }

ln(81725/(137+1)) + 1 = 6.38 + 1 = 7.38



   0.13543049 = queryNorm
1.998316 = (MATCH) fieldWeight(contents:chaperone in 73615), product of: 1.7320508 = tf(termFreq(contents:chaperone)=3) *-> The doc has 32 tokens
(according to luke) and 3/32 != 1.7320508*

3^0.5 = 1.73...


   7.3838615 = idf(docFreq=137, numDocs=81725)
   0.15625 = fieldNorm(field=contents, doc=73615)

---------------------------------------------------------------------------


So, what am I missing? I read the regular tf-idf rule from wikipedia, along with some other text books I found, so I'm pretty sure it is ok. I didn't set any boost factor or anything (otherwise it would also appear here I suppose). I am using the Standard Analyzer, thus accounting for a higher tf,
but not that enormity.

--------------------------
Grant Ingersoll
http://www.lucenebootcamp.com
Next Training: April 7, 2008 at ApacheCon Europe in Amsterdam

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ





Reply via email to