That's an interesting point that helps to better analyze the situation. It seems to me the units are arbitrary and so the distance in this case is not very meaningful. I don't believe Lucene actually uses the document vector -- it uses the orthogonal projection of the document vector into the hyperspace of query terms, since it only considers document vector terms corresponding to query vector terms. The distance from the tip of the projected document vector to the hyperplane orthogonal to the query vector (within the query hyperspace) does not seem that meaningful, even if the units were clear and natural. Document vectors at different angles and arbitrarily large distances from the query vector can have the same length to this plane.
>From a practical standpoint, I still think it is important to have meaningful normalized final scores so that applications can interpret these scores, for example to present results to users in a manner that depends on the relevance of the individual results. This seems easy to do in a natural way along the lines of my last proposal (boost-weighted normalization, possibly including some other factors). I've been busy on other aspects of my project, but still hope to get pack to this and contribute a proposed improved scoring scheme. Chuck > -----Original Message----- > From: Christoph Goller [mailto:[EMAIL PROTECTED] > Sent: Sunday, October 31, 2004 8:00 AM > To: Lucene Developers List > Subject: About Hit Scoring > > I looked at the scoring mechanism more closely again. Some of you may > remember that there was a discussion about this recently. There was > especially some argument about the theoretical justification of > the current scoring algorithm. Chuck proposed that at least from > a theoretical perspective it would be good to apply a normalization > on the document vector and thus implement the cosine similarity. > > Well, we found out that this cannot be implemented efficienty. > However, I now found out the the current algorithm has a very > intuitive theoretical justification. Some of you may already know > that, but I never looked into it that deeply. > > Both the query and all documents are represented as vectors in term > vector space. The current scoring is simply the dot product of the > query with a document normalized by the length of the query vector > (if we skip the additional coord factor). Geometrically speaking this > is the distance of the document vector from the hyperplane through > the origin which is orthogonal to the query vector. See attached > figure. > > Christoph > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]