Addendum:  I forgot probably the most important point.  The current
normalization in Hits changes the final score so that it is not the
distance to the query-orthogonal hyperplane.  This normalization renders
the final score ambiguous, and more confused.  It's ambiguous since the
normalization may or may not be applied (depending on the fairly
arbitrary condition of whether or not the top raw score is greater than
1.0).  In cases where the normalization is applied, then a result's
final score is the ratio of its distance from the query-orthogonal
hyperplane to the largest distance of any result, which doesn't seem
particularly meaningful to me.  At least there is no absolute
interpretation for this score, in the sense that a specific number
indicates a specific relevance, which is what I'm looking for.

Chuck

  > -----Original Message-----
  > From: Chuck Williams [mailto:[EMAIL PROTECTED]
  > Sent: Sunday, October 31, 2004 8:13 AM
  > To: Lucene Developers List
  > Subject: RE: About Hit Scoring
  > 
  > That's an interesting point that helps to better analyze the
situation.
  > It seems to me the units are arbitrary and so the distance in this
case
  > is not very meaningful.  I don't believe Lucene actually uses the
  > document vector -- it uses the orthogonal projection of the document
  > vector into the hyperspace of query terms, since it only considers
  > document vector terms corresponding to query vector terms.  The
distance
  > from the tip of the projected document vector to the hyperplane
  > orthogonal to the query vector (within the query hyperspace) does
not
  > seem that meaningful, even if the units were clear and natural.
  > Document vectors at different angles and arbitrarily large distances
  > from the query vector can have the same length to this plane.
  > 
  > From a practical standpoint, I still think it is important to have
  > meaningful normalized final scores so that applications can
interpret
  > these scores, for example to present results to users in a manner
that
  > depends on the relevance of the individual results.  This seems easy
to
  > do in a natural way along the lines of my last proposal
(boost-weighted
  > normalization, possibly including some other factors).  I've been
busy
  > on other aspects of my project, but still hope to get pack to this
and
  > contribute a proposed improved scoring scheme.
  > 
  > Chuck
  > 
  >   > -----Original Message-----
  >   > From: Christoph Goller [mailto:[EMAIL PROTECTED]
  >   > Sent: Sunday, October 31, 2004 8:00 AM
  >   > To: Lucene Developers List
  >   > Subject: About Hit Scoring
  >   >
  >   > I looked at the scoring mechanism more closely again. Some of
you
  > may
  >   > remember that there was a discussion about this recently. There
was
  >   > especially some argument about the theoretical justification of
  >   > the current scoring algorithm. Chuck proposed that at least from
  >   > a theoretical perspective it would be good to apply a
normalization
  >   > on the document vector and thus implement the cosine similarity.
  >   >
  >   > Well, we found out that this cannot be implemented efficienty.
  >   > However, I now found out the the current algorithm has a very
  >   > intuitive theoretical justification. Some of you may already
know
  >   > that, but I never looked into it that deeply.
  >   >
  >   > Both the query and all documents are represented as vectors in
term
  >   > vector space. The current scoring is simply the dot product of
the
  >   > query with a document normalized by the length of the query
vector
  >   > (if we skip the additional coord factor). Geometrically speaking
  > this
  >   > is the distance of the document vector from the hyperplane
through
  >   > the origin which is orthogonal to the query vector. See attached
  >   > figure.
  >   >
  >   > Christoph
  >   >
  >   >
  >   >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to