Christoph, I'd like to spend more time looking at this, but won't be able to until tomorrow. It's a little confusing because the explain() mechanism is not consistent with the actual score()'s. There's an additional normalization being applied to bring score()'s into [0,1] that explain() does not show.
Looking at the code, I think the cancellations you've made below are obscuring the fact that idf is also squared in single Term scores, not just when Term's occur in BooleanQuery's. At least that is consistent. The inconsistency in the formulas troubled me when I first looked at it, but it turns out it doesn't matter (so even if I'm wrong about the single Term formula it doesn't matter). That's because idf is irrelevant in a single term query, as it is a constant multiplier in all results that is just normalized out. I think there are at least two bugs here: 1. idf should not be squared. 2. explain() should explain the actual reported score(). I would venture a guess that these bugs are historical artifacts. Does anybody know if normalization was introduced into the code after the original scoring mechanisms were written? The idf's need to be considered for normalization to work properly, which could have led to the inadvertent squaring. Chuck > -----Original Message----- > From: Christoph Goller [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 13, 2004 2:04 AM > To: Lucene Developers List > Subject: Search and Scoring > > > As an aside, is there a reason that idf is squared in each Term and > > Phrase match (it is multiplied both into the query component and the > > field component)? To compensate for this, I'm taking the square root of > > the idf I really want in my Similarity, which seems strange. > > Hi Chuck, > > that's a very good question. And you are right, it may be a bug, I am > not sure about it. I stumbled over this several times when studying > code in the search package. It's a little bit difficult to explain since > the code for score computation is distributed over Weight and Scorer > classes. It seems that a TermQuery and a PhraseQuery weight is > multiplied with idf twice, first in sumOfSquaredWeights() and then in > normalize. That's what you discovered. > > The formula in Similarity Javadoc does not describe the scoring completely. > I try to write down the formula that exactly describes the current > implementation. Then we can start a discussion and people could decide > whether this is the intended scoring. (I assume DefaultSimilarity here) > > Lt's start with the simple case. A pure TermQuery (one word query) gets > the following score after cancelling down queryNorm(t) and queryBoost(t) > (coord is 1 here) > > t: TermQuery > d: document > > score(t, d) = > tf(t in d) * idf(t) * fieldBoost(t.field in d) * lengthFieldNorm(t.field > in d) > > Note that fieldBoost and lengthNorm are both combined in norms. > > For a BooleanQuery consisting of several TermQueries we get the following: > (Again we can cancel down queryBoost(q)) > > q: BooleanQuery > t: Term and corresponding TermQuery > d: document > > score(q, d) = coord(q, d) * queryNorm(q) * > SUM_{t in q} ( tf(t in d) * idf(t)^2 * queryBoost(t) * > fieldBoost(t.field in d) > * lengthFieldNorm(t.field in d) ) > > where > coord(q, d) = "fraction of TermQueries occuring in d" > queryNorm(q) = 1 / SQRT( SUM_{t in q} ( (idf(t) * queryBoost(t) )^2 ) ) > > I hope this starts a discussion. > > Christoph > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]