Good point on the irrelevance of the non-query-hyperspace document directions to the hyperplane distance. These other coordinates do affect the angle to the query vector, but not the distance to the query-orthogonal hyperplane.
My problem with the units actually arose from the tf's and especially the idf's, so I don't see these terms as clarifying the interpretation of the current scores. In my app, the idf factors were dominating the scale and magnitude of the final scores and I could not figure out any way to interpret what they meant in any absolute sense. I have a custom similarity that has increased the base of the logarithm to tone them down, and takes a final square root to eliminate the squaring, which I continue to believe has been shown in the field to be empirically and theoretically unjustified. Chuck > -----Original Message----- > From: Christoph Goller [mailto:[EMAIL PROTECTED] > Sent: Sunday, October 31, 2004 8:55 AM > To: Lucene Developers List > Subject: Re: About Hit Scoring > > Chuck Williams schrieb: > > That's an interesting point that helps to better analyze the situation. > > It seems to me the units are arbitrary and so the distance in this > case > > is not very meaningful. I don't believe Lucene actually uses the > > document vector -- it uses the orthogonal projection of the document > > vector into the hyperspace of query terms, since it only considers > > document vector terms corresponding to query vector terms. > > For the distance of a document vector to the query-hyperplane, the > other directions of the document vector are irrelevant. > > > The distance > > from the tip of the projected document vector to the hyperplane > > orthogonal to the query vector (within the query hyperspace) does not > > seem that meaningful, even if the units were clear and natural. > > Document vectors at different angles and arbitrarily large distances > > from the query vector can have the same length to this plane. > > The term frequency is normalized by the field length and furthermore > there is still idf that comes in. So the units do at least have some > meaning. > > > From a practical standpoint, I still think it is important to have > > meaningful normalized final scores so that applications can interpret > > these scores, for example to present results to users in a manner > that > > depends on the relevance of the individual results. This seems easy > to > > do in a natural way along the lines of my last proposal (boost- > weighted > > normalization, possibly including some other factors). > > I still agree that it would be great to have scores that could be > compared > between different queries. > > Christoph > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]