Chuck Williams schrieb:
That's an interesting point that helps to better analyze the situation.
It seems to me the units are arbitrary and so the distance in this case
is not very meaningful. I don't believe Lucene actually uses the
document vector -- it uses the orthogonal projection of the document
vector into the hyperspace of query terms, since it only considers
document vector terms corresponding to query vector terms.

For the distance of a document vector to the query-hyperplane, the other directions of the document vector are irrelevant.

The distance
from the tip of the projected document vector to the hyperplane
orthogonal to the query vector (within the query hyperspace) does not
seem that meaningful, even if the units were clear and natural.
Document vectors at different angles and arbitrarily large distances
from the query vector can have the same length to this plane.

The term frequency is normalized by the field length and furthermore there is still idf that comes in. So the units do at least have some meaning.

> From a practical standpoint, I still think it is important to have
> meaningful normalized final scores so that applications can interpret
> these scores, for example to present results to users in a manner that
> depends on the relevance of the individual results.  This seems easy to
> do in a natural way along the lines of my last proposal (boost-weighted
> normalization, possibly including some other factors).

I still agree that it would be great to have scores that could be compared
between different queries.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to