I looked at the scoring mechanism more closely again. Some of you may remember that there was a discussion about this recently. There was especially some argument about the theoretical justification of the current scoring algorithm. Chuck proposed that at least from a theoretical perspective it would be good to apply a normalization on the document vector and thus implement the cosine similarity.
Well, we found out that this cannot be implemented efficienty. However, I now found out the the current algorithm has a very intuitive theoretical justification. Some of you may already know that, but I never looked into it that deeply.
Both the query and all documents are represented as vectors in term vector space. The current scoring is simply the dot product of the query with a document normalized by the length of the query vector (if we skip the additional coord factor). Geometrically speaking this is the distance of the document vector from the hyperplane through the origin which is orthogonal to the query vector. See attached figure.
Christoph
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]