Addendum: I forgot probably the most important point. The current normalization in Hits changes the final score so that it is not the distance to the query-orthogonal hyperplane. This normalization renders the final score ambiguous, and more confused. It's ambiguous since the normalization may or may not be applied (depending on the fairly arbitrary condition of whether or not the top raw score is greater than 1.0). In cases where the normalization is applied, then a result's final score is the ratio of its distance from the query-orthogonal hyperplane to the largest distance of any result, which doesn't seem particularly meaningful to me. At least there is no absolute interpretation for this score, in the sense that a specific number indicates a specific relevance, which is what I'm looking for.
Chuck > -----Original Message----- > From: Chuck Williams [mailto:[EMAIL PROTECTED] > Sent: Sunday, October 31, 2004 8:13 AM > To: Lucene Developers List > Subject: RE: About Hit Scoring > > That's an interesting point that helps to better analyze the situation. > It seems to me the units are arbitrary and so the distance in this case > is not very meaningful. I don't believe Lucene actually uses the > document vector -- it uses the orthogonal projection of the document > vector into the hyperspace of query terms, since it only considers > document vector terms corresponding to query vector terms. The distance > from the tip of the projected document vector to the hyperplane > orthogonal to the query vector (within the query hyperspace) does not > seem that meaningful, even if the units were clear and natural. > Document vectors at different angles and arbitrarily large distances > from the query vector can have the same length to this plane. > > From a practical standpoint, I still think it is important to have > meaningful normalized final scores so that applications can interpret > these scores, for example to present results to users in a manner that > depends on the relevance of the individual results. This seems easy to > do in a natural way along the lines of my last proposal (boost-weighted > normalization, possibly including some other factors). I've been busy > on other aspects of my project, but still hope to get pack to this and > contribute a proposed improved scoring scheme. > > Chuck > > > -----Original Message----- > > From: Christoph Goller [mailto:[EMAIL PROTECTED] > > Sent: Sunday, October 31, 2004 8:00 AM > > To: Lucene Developers List > > Subject: About Hit Scoring > > > > I looked at the scoring mechanism more closely again. Some of you > may > > remember that there was a discussion about this recently. There was > > especially some argument about the theoretical justification of > > the current scoring algorithm. Chuck proposed that at least from > > a theoretical perspective it would be good to apply a normalization > > on the document vector and thus implement the cosine similarity. > > > > Well, we found out that this cannot be implemented efficienty. > > However, I now found out the the current algorithm has a very > > intuitive theoretical justification. Some of you may already know > > that, but I never looked into it that deeply. > > > > Both the query and all documents are represented as vectors in term > > vector space. The current scoring is simply the dot product of the > > query with a document normalized by the length of the query vector > > (if we skip the additional coord factor). Geometrically speaking > this > > is the distance of the document vector from the hyperplane through > > the origin which is orthogonal to the query vector. See attached > > figure. > > > > Christoph > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]