RE: About Hit Scoring

Chuck Williams Sun, 31 Oct 2004 09:07:39 -0800

Good point on the irrelevance of the non-query-hyperspace document
directions to the hyperplane distance.  These other coordinates do
affect the angle to the query vector, but not the distance to the
query-orthogonal hyperplane.


My problem with the units actually arose from the tf's and especially
the idf's, so I don't see these terms as clarifying the interpretation
of the current scores.  In my app, the idf factors were dominating the
scale and magnitude of the final scores and I could not figure out any
way to interpret what they meant in any absolute sense.  I have a custom
similarity that has increased the base of the logarithm to tone them
down, and takes a final square root to eliminate the squaring, which I
continue to believe has been shown in the field to be empirically and
theoretically unjustified.

Chuck

  > -----Original Message-----
  > From: Christoph Goller [mailto:[EMAIL PROTECTED]
  > Sent: Sunday, October 31, 2004 8:55 AM
  > To: Lucene Developers List
  > Subject: Re: About Hit Scoring
  > 
  > Chuck Williams schrieb:
  > > That's an interesting point that helps to better analyze the
situation.
  > > It seems to me the units are arbitrary and so the distance in this
  > case
  > > is not very meaningful.  I don't believe Lucene actually uses the
  > > document vector -- it uses the orthogonal projection of the
document
  > > vector into the hyperspace of query terms, since it only considers
  > > document vector terms corresponding to query vector terms.
  > 
  > For the distance of a document vector to the query-hyperplane, the
  > other directions of the document vector are irrelevant.
  > 
  > > The distance
  > > from the tip of the projected document vector to the hyperplane
  > > orthogonal to the query vector (within the query hyperspace) does
not
  > > seem that meaningful, even if the units were clear and natural.
  > > Document vectors at different angles and arbitrarily large
distances
  > > from the query vector can have the same length to this plane.
  > 
  > The term frequency is normalized by the field length and furthermore
  > there is still idf that comes in. So the units do at least have some
  > meaning.
  > 
  >  > From a practical standpoint, I still think it is important to
have
  >  > meaningful normalized final scores so that applications can
interpret
  >  > these scores, for example to present results to users in a manner
  > that
  >  > depends on the relevance of the individual results.  This seems
easy
  > to
  >  > do in a natural way along the lines of my last proposal (boost-
  > weighted
  >  > normalization, possibly including some other factors).
  > 
  > I still agree that it would be great to have scores that could be
  > compared
  > between different queries.
  > 
  > Christoph
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: About Hit Scoring

Reply via email to