Hello all! As of lately, I've been interested in understanding how Lucene scores my documents, and so I've asked a couple of questions in the mailing list already. I was instructed to read the Similarity Class documentation, which should have all the info I want. Indeed, it mentions all of the elements I need for my research. However, I keep failing to get their value for each of the hits I get from my query.
The elements I want are the TF and the IDF. I want to use them to calculate my own score, since I don't want to have the hassle to change the whole core of the scoring machine. However, as I've understood, Lucene's notion of TF-IDF isn't mine's. So, I'm left with "translating" its values to my values. As for the IDF, I've managed to squeeze the elements I want from maxDocs and docFreq methods. And in two lines I have my IDF value. Regarding TF however, it's a little more complex.. I can't even get the value of the TF for each document, and I can't get near the values needed to calculate it.. from the explain method, I see that lucene HAS to calculate them, somehow, but I don't know where to look, how to look. If someone can give me a hand on this, I'd be glad :) I've read pretty much all the documentation concerning the tf() method (which lead me nowhere), as well as other methods' documentation, so this is kind of my last resort. The option I'm considering is to store the TermFreqVectors and work my way from them. João Rodrigues
