Doug, That's a good point on how the standard vector space inner product similarity measure does imply that the idf is squared relative to the document tf. Even having been aware of this formula for a long time, this particular implication never occurred to me. Do you know if anybody has done precision/recall or other relevancy empirical measurements comparing this vs. a model that does not square idf?
Regarding normalization, the normalization in Hits does not have very nice properties. Due to the > 1.0 threshold check, it loses information, and it arbitrarily defines the highest scoring result in any list that generates scores above 1.0 as a perfect match. It would be nice if score values were meaningful independent of searches, e.g., if "0.6" meant the same quality of retrieval independent of what search was done. This would allow, for example, sites to use a a simple quality threshold to only show results that were "good enough". At my last company (I was President and head of engineering for InQuira), we found this to be important to many customers. The standard vector space similarity measure includes normalization by the product of the norms of the vectors, i.e.: score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) / sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ] This makes the score a cosine, which since the values are all positive, forces it to be in [0, 1]. The sumOfSquares() normalization in Lucene does not fully implement this. Is there a specific reason for that? Re. explain(), I don't see a downside to extending it show the final normalization in Hits. It could still show the raw score just prior to that normalization. Although I think it would be best to have a normalization that would render scores comparable across searches. Chuck > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 13, 2004 9:38 AM > To: Lucene Developers List > Subject: Re: Search and Scoring > > Chuck Williams wrote: > > I think there are at least two bugs here: > > 1. idf should not be squared. > > I discussed this in a separate message. It's not a bug. > > > 2. explain() should explain the actual reported score(). > > This is mostly a documentation bug in Hits. The normalization of scores > to 1.0 is performed only by Hits. Hits is a high-level wrapper on the > lower-level HitCollector-based search implementations, which do not > perform this normalization. We should probably document that Hits > scores are so normalized. Also, we could add a method to disable this > normalization in Hits. The normalization was added long ago because > many folks found it disconcerting when scores were greater than 1.0. > > We should not attempt to normalize scores reported by explain(). The > intended use of explain() is to compare its output against other calls > to explain(), in order to understand how one document scores higher than > another. Scores don't make much sense in isolation, and neither do > explanations. > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -----Original Message----- > From: Doug Cutting [mailto:[EMAIL PROTECTED] > Sent: Wednesday, October 13, 2004 9:25 AM > To: Lucene Developers List > Subject: Re: Contribution: better multi-field searching > > Paul Elschot wrote: > >>Did you see my IDF question at the bottom of the original note? I'm > >>really curious why the square of IDF is used for Term and Phrase > >>queries, rather than just IDF. It seems like it might be a bug? > > > > I missed that. > > It has been discussed recently, but I don't remember the outcome, > > perhaps some else? > > This has indeed been discussed before. > > Lucene computes a dot-product of a query vector and each document > vector. Weights in both vectors are normalized tf*idf, i.e., > (tf*idf)/length. The dot product of vectors d and q is: > > score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) > > Given this formulation, and the use of tf*idf weights, each component of > the sum has an idf^2 factor. That's just the way it works with dot > products of tf*idf/length vectors. It's not a bug. If folks don't like > it they can simply override Similarity.idf() to return sqrt(super()). > > If someone can demonstrate that an alternate formulation produces > superior results for most applications, then we should of course change > the default implementation. But just noting that there's a factor which > is equal to idf^2 in each element of the sum does not do this. > > Doug > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]