That's a good point on how the standard vector space inner product similarity measure does imply that the idf is squared relative to the document tf. Even having been aware of this formula for a long time, this particular implication never occurred to me. Do you know if anybody has done precision/recall or other relevancy empirical measurements comparing this vs. a model that does not square idf?
No, not that I know of.
Regarding normalization, the normalization in Hits does not have very nice properties. Due to the > 1.0 threshold check, it loses information, and it arbitrarily defines the highest scoring result in any list that generates scores above 1.0 as a perfect match. It would be nice if score values were meaningful independent of searches, e.g., if "0.6" meant the same quality of retrieval independent of what search was done. This would allow, for example, sites to use a a simple quality threshold to only show results that were "good enough". At my last company (I was President and head of engineering for InQuira), we found this to be important to many customers.
If this is a big issue for you, as it seems it is, please submit a patch to optionally disable score normalization in Hits.java.
The standard vector space similarity measure includes normalization by the product of the norms of the vectors, i.e.:
score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) / sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ]
This makes the score a cosine, which since the values are all positive, forces it to be in [0, 1]. The sumOfSquares() normalization in Lucene does not fully implement this. Is there a specific reason for that?
The quantity 'sum(t) weight(t,d)^2' must be recomputed for each document each time a document is added to the collection, since 'weight(t,d)' is dependent on global term statistics. This is prohibitivly expensive. Research has also demonstrated that such cosine normalization gives somewhat inferior results (e.g., Singhal's pivoted length normalization).
Re. explain(), I don't see a downside to extending it show the final normalization in Hits. It could still show the raw score just prior to that normalization.
In order to normalize scores to 1.0 one must know the maximum score. Explain only computes the score for a single document, and the maximum score is not known.
> Although I think it would be best to have a > normalization that would render scores comparable across searches.
Then please submit a patch. Lucene doesn't change on its own.
Doug
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]