That's an interesting point that helps to better analyze the situation.
It seems to me the units are arbitrary and so the distance in this case
is not very meaningful.  I don't believe Lucene actually uses the
document vector -- it uses the orthogonal projection of the document
vector into the hyperspace of query terms, since it only considers
document vector terms corresponding to query vector terms.  The distance
from the tip of the projected document vector to the hyperplane
orthogonal to the query vector (within the query hyperspace) does not
seem that meaningful, even if the units were clear and natural.
Document vectors at different angles and arbitrarily large distances
from the query vector can have the same length to this plane.

>From a practical standpoint, I still think it is important to have
meaningful normalized final scores so that applications can interpret
these scores, for example to present results to users in a manner that
depends on the relevance of the individual results.  This seems easy to
do in a natural way along the lines of my last proposal (boost-weighted
normalization, possibly including some other factors).  I've been busy
on other aspects of my project, but still hope to get pack to this and
contribute a proposed improved scoring scheme.

Chuck

  > -----Original Message-----
  > From: Christoph Goller [mailto:[EMAIL PROTECTED]
  > Sent: Sunday, October 31, 2004 8:00 AM
  > To: Lucene Developers List
  > Subject: About Hit Scoring
  > 
  > I looked at the scoring mechanism more closely again. Some of you
may
  > remember that there was a discussion about this recently. There was
  > especially some argument about the theoretical justification of
  > the current scoring algorithm. Chuck proposed that at least from
  > a theoretical perspective it would be good to apply a normalization
  > on the document vector and thus implement the cosine similarity.
  > 
  > Well, we found out that this cannot be implemented efficienty.
  > However, I now found out the the current algorithm has a very
  > intuitive theoretical justification. Some of you may already know
  > that, but I never looked into it that deeply.
  > 
  > Both the query and all documents are represented as vectors in term
  > vector space. The current scoring is simply the dot product of the
  > query with a document normalized by the length of the query vector
  > (if we skip the additional coord factor). Geometrically speaking
this
  > is the distance of the document vector from the hyperplane through
  > the origin which is orthogonal to the query vector. See attached
  > figure.
  > 
  > Christoph
  > 
  > 
  > 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to