Hello all!

As of lately, I've been interested in understanding how Lucene scores my
documents, and so I've asked a couple of questions in the mailing list
already. I was instructed to read the Similarity Class documentation, which
should have all the info I want. Indeed, it mentions all of the elements I
need for my research. However, I keep failing to get their value for each of
the hits I get from my query.

The elements I want are the TF and the IDF. I want to use them to calculate
my own score, since I don't want to have the hassle to change the whole core
of the scoring machine. However, as I've understood, Lucene's notion of
TF-IDF isn't mine's. So, I'm left with "translating" its values to my
values. As for the IDF, I've managed to squeeze the elements I want from
maxDocs and docFreq methods. And in two lines I have my IDF value. Regarding
TF however, it's a little more complex.. I can't even get the value of the
TF for each document, and I can't get near the values needed to calculate
it.. from the explain method, I see that lucene HAS to calculate them,
somehow, but I don't know where to look, how to look.

If someone can give me a hand on this, I'd be glad :) I've read pretty much
all the documentation concerning the tf() method (which lead me nowhere), as
well as other methods' documentation, so this is kind of my last resort. The
option I'm considering is to store the TermFreqVectors and work my way from
them.

João Rodrigues

Reply via email to