Thanks Otis. Other than trying to get some consensus a) that this is a problem worth fixing, and b) on the best approach to fix it, my central question is, if I fix it is it likely to get incorporated back into Lucene? I don't want to deviate from Lucene sources, especially with so many classes, and so would like to address this only if there is a process to evaluate the changes and incorporate them back into Lucene if they provide the improvement I believe they will.
Thanks for any guidance on that, Chuck > -----Original Message----- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 21, 2004 10:06 AM > To: Lucene Developers List > Subject: Re: Normalized Scoring -- was RE: idf and explain(), was Re: > Search and Scoring > > Hi Chuck, > > The relative lack of responses is not because there is no interest, but > probably because there are only a few people on lucene-dev who can > follow/understand every detail of your proposal. I understand and hear > you, but I have a hard time 'visualizing' some of the formulas in your > proposal. What you are saying sounds right to me, but I don't have > enough theoretical knowledge to go one way or the other. > > Otis > > > --- Chuck Williams <[EMAIL PROTECTED]> wrote: > > > Hi everybody, > > > > Although there doesn't seem to be much interest in this I have one > > significant improvement to the below and a couple observations that > > clarify the situation. > > > > To illustrate the problem better normalization is intended to > > address, > > in my current application for BooleanQuery's of two terms, I > > frequently > > get a top score of 1.0 when only one of the terms is matched. 1.0 > > should indicate a "perfect match". I'd like set my UI up to present > > the > > results differently depending on how good the different results are > > (e.g., showing a visual indication of result quality, dropping the > > really bad results entirely, and segregating the good results from > > other > > only vaguely relevant results). To build this kind of "intelligence" > > into the UI, I certainly need to know whether my top result matched > > all > > query terms, or only half of them. I'd like to have the score tell > > me > > directly how good the matches are. The current normalization does > > not > > achieve this. > > > > The intent of a new normalization scheme is to preserve the current > > scoring in the following sense: the ratio of the nth result's score > > to > > the best result's score remains the same. Therefore, the only > > question > > is what normalization factor to apply to all scores. This reduces to > > a > > very specific question that determines the entire normalization -- > > what > > should the score of the best matching result be? > > > > The mechanism below has this property, i.e. it keeps the current > > score > > ratios, except that I removed one idf term for reasons covered > > earlier > > (this better reflects the current empirically best scoring > > algorithms). > > However, removing an idf is a completely separate issue. The > > improved > > normalization is independent of whether or not that change is made. > > > > For the central question of what the top score should be, upon > > reflection, I don't like the definition below. It defined the top > > score > > as (approximately) the percentage of query terms matched by the top > > scoring result. A better conceptual definition is to use a weighted > > average based on the boosts. I.e., downward propagate all boosts to > > the > > underlying terms (or phrases). Secifically, the "net boost" of a > > term > > is the product of the direct boost of the term and all boosts applied > > to > > encompassing clauses. Then the score of the top result becomes the > > sum > > of the net boosts of its matching terms divided by the sum of the net > > boosts of all query terms. > > > > This definition is a refinement of the original proposal below, and > > it > > could probably be further refined if some impact of the tf, idf > > and/or > > lengthNorm was desired in determining the top score. These other > > factors seems to be harder to normalize for, although I've thought of > > some simple approaches; e.g., assume the unmatched terms in the top > > result have values for these three factors that are the average of > > the > > values of the matched terms, then apply exactly the same concept of > > actual score divided by theorectical maximum score. That would > > eliminate any need to maintain maximum value statistics in the index. > > > > As an example of the simple boost-based normalization, for the query > > ((a^2 b)^3 (c d^2)) > > the net boosts are: > > a --> 6 > > b --> 3 > > c --> 1 > > d --> 2 > > > > So if a and b matched, but not c and d, in the top scoring result, > > its > > score would be 0.75. The normalizer would be 0.75/(current score > > except > > for the current normalization). This normalizer would be applied to > > all > > current scores (minus normalization) to create the normalized scores. > > > > For simple query (a b), if only one of the terms matched in the top > > result, then its score would be 0.5, vs. 1.0 or many other possible > > scores today. > > > > In addition to enabling more "intelligent" UI's that communicate the > > quality of results to end-users, the proposal below also extends the > > explain() mechanism to fully explain the final normalized score. > > However, that change is also independent -- it could be done with the > > current scoring. > > > > Am I the only one who would like to see better normalization in > > Lucene? > > Does anybody have a better approach? > > > > If you've read this far, thanks for indulging me on this. I would > > love > > to see this or something with similar properties in Lucene. I really > > just want to build my app, but as stated below would write and > > contribute this if there is interest in putting it in, and nobody > > else > > wants to write it. Please let me know what you think one way or the > > other. > > > > Thanks, > > > > Chuck > > > > > > > -----Original Message----- > > > From: Chuck Williams > > > Sent: Monday, October 18, 2004 7:04 PM > > > To: 'Lucene Developers List' > > > Subject: RE: idf and explain(), was Re: Search and Scoring > > > > > > Doug Cutting wrote: > > > > If this is a big issue for you, as it seems it is, please > > submit > > a > > > patch > > > > to optionally disable score normalization in Hits.java. > > > and: > > > > The quantity 'sum(t) weight(t,d)^2' must be recomputed for > > each > > > document > > > > each time a document is added to the collection, since > > 'weight(t,d)' > > > is > > > > dependent on global term statistics. This is prohibitivly > > expensive. > > > > Research has also demonstrated that such cosine normalization > > gives > > > > somewhat inferior results (e.g., Singhal's pivoted length > > > normalization). > > > > > > I'm willing to write, test and contribute code to address the > > > normalization issue, i.e. to yield scores in [0, 1] that are > > meaningful > > > across searches. Unfortunately, this is considerably more > > involved > > that > > > just optionally eliminating the current normalization in Hits. > > Before > > > undertaking this, I'd like to see if there is agreement in > > principle > > > that this is a good idea, and that my specific proposal below is > > the > > > right way to go. Also, I'd like to make sure I've correctly > > inferred > > > the constraints on writing code to be incorporated into Lucene. > > > > > > After looking at this in more detail I agree that the cosine > > > normalization is not the way to go, because of both efficiency > > and > > > effectiveness considerations. A conceptual approach that would > > be > > > efficient, relatively easy to implement, and seems to have at > > least > > > reasonable behavior would be to define the top scoring match to > > have > > a > > > score roughly equal to the percentage of query terms it matches > > (its > > > "netCoord"). Scores below the top hit would be reduced based on > > the > > > ratio of their raw scores to the raw score of the top hit > > (considering > > > all of the current score factors, except that I'd like to remove > > one > > of > > > the idf factors, as discussed earlier). > > > > > > For a couple simple cases: > > > a) the top match for a single term query would always have a > > score > > of > > > 1.0, > > > b) the top scoring match for a BooleanQuery using > > DefaultSimilarity > > > with all non-prohibited TermQuery clauses would have a score of > > m/n, > > > where the hit matches m of the n terms. > > > > > > This isn't optimal, but seems much better than the current > > situation. > > > Consider two single-term queries, s and t. If s matches more > > strongly > > > than t in its top hit (e.g., a higher tf in a shorter field), it > > would > > > be best if the top score of s was greater score than top score of > > t. > > > But this kind of normalization would require keeping additional > > > statistics that so far as I know are not currently in the index, > > like > > > the maximum tf for terms and the minimum length for fields. > > These > > could > > > be expensive to update on deletes. Also, normalizing by such > > factors > > > could yield lower than subjectively reasonable scores in most > > cases, > > so > > > it's not clear it would be better. > > > > > > The semantics above are at least clean, easy to understand, and > > support > > > what seems to me is the most important motivation to do this: > > allowing > > > an application to use simple thresholding to segregate > > likely-to-be- > > > relevant hits from likely-to-be-irrelevant hits. > > > > > > More specifically, for a BooleanQuery of TermQuery's the > > resulting > > score > > > functions would be: > > > > > > BooleanQuery of TermQuery's sbq = (tq1 ... tqn) > > > > > > baseScore(sbq, doc) = > > > sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* > > > lengthNorm(tqi.term.field, doc) > > > > > > rawScore(sbq, doc) = coord(sbq, doc) * baseScore > > > > > > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) > > > > > > score(sbq, doc) = rawScore * norm > > > > > > rawScore's can be computed in the Scorer.score() methods and > > therefore > > > used to sort the hits (e.g., in the instance method for collect() > > in > > the > > > HitCollector in IndexSearcher.search()). If the top scoring hit > > does > > > not have the highest baseScore, then its score could be less that > > its > > > coord; this seems desirable. These formulas imply that no result > > score > > > can be larger than its coord, so if coord is well-defined (always > > > between 0 and 1) then all results will be normalized between 0 > > and > > 1. > > > > > > In general, the netCoord, which takes the place of coord in the > > simple > > > case above, needs to be defined for any query. Conceptually, > > this > > > should be the total percentage of query terms matched by the > > document. > > > It must be recursively computable from the subquery structure and > > > overridable for specific Query types (e.g., to support > > specialized > > > coords, like one that is always 1.0 as is useful in multi-field > > > searching). Suitable default definitions for TermQuery and > > BooleanQuery > > > are: > > > > > > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise > > > > > > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * > > ci.netCoord > > > > > > This is not quite percentage of terms matched; e.g., consider a > > > BooleanQuery with two clauses, one of which is a BooleanQuery of > > 3 > > terms > > > and the other which is just a term. However, it doesn't seem to > > be > > > unreasonable for a BooleanQuery to state that its clauses are > > equally > > > important, and this is consistent with the current coord > > behavior. > > > BooleanQuery.netCoord could be overridden for special cases, like > > the > > > pure disjunction I use in my app for field expansions: > > > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord > > > > > > Implementing this would proceed along these lines: > > > 1. For backwards compatibility, add some kind of newScoring > > boolean > > > setting. > > > 2. Update all of these places to behave as indicated if > > newScoring: > > > a. Change Query.weight() to not do any normalization (no > > call > > to > > > sumOfSquaredWeights(), Similarity.queryNorm() or normalize()). > > > b. Update all Query.weight classes to set their value > > according > > to > > > the terms in the score formula above that don't involve the > > document > > > (e.g., boost*idf for TermQuery). > > > c. Add suitable netCoord definitions to all Scorer classes. > > > d. Update all Scorer.score() methods to compute the rawScore > > as > > > above. > > > e. Add the maximum baseScore as a field kept on TopDocs, > > computed > > > in the HitCollector's. > > > f. Change the normalization in Hits to always divide every > > raw > > > score by the maximum baseScore. > > > g. Update all of the current explain() methods to be > > consistent > > > with this scoring, and to either report the rawScore, or to > > report > > the > > > final score if the normalization factor is provided. > > > h. Add Hits.explain() (or better extend Searcher so that it > > keeps > > > the Hits for use in Searcher.explain()) to call the new explain > > > variation with the normalization factor so that final scores are > > fully > > > explained. > > > > > > If this seems like a good idea, please let me know. I'm sure > > there > > are > > > details I've missed that would come out during coding and > > testing. > > Also, > > > the value of this is dependent on how reasonable the final scores > > look, > > > which is hard to tell for sure until it is working. > > > > > > The coding standards for Lucene seem reasonably clear from the > > source > > > code I've read. I could use just simple Java so there shouldn't > > be > > any > > > significant JVM dependencies. The above should be fully backward > > > compatible due to the newScoring flag. > > > > > > On another note, I had to remove the German analyzer in my > > current > > 1.4.2 > > > source configuration because GermanStemmer failed to compile due > > to > > what > > > are apparently Unicode character constants that I've now got as > > illegal > > > two-character character constants. This is presumably an > > encoding > > > problem somewhere that I could track down. It's not important, > > but > > if > > > the answer is obvious to any of you, I'd appreciate the quick > > tip. > > > > > > Thanks, > > > > > > Chuck > > > > > > > -----Original Message----- > > > > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > > > Sent: Monday, October 18, 2004 9:44 AM > > > > To: Lucene Developers List > > > > Subject: Re: idf and explain(), was Re: Search and Scoring > > > > > > > > Chuck Williams wrote: > > > > > That's a good point on how the standard vector space inner > > product > > > > > similarity measure does imply that the idf is squared > > relative > > to > > > the > > > > > document tf. Even having been aware of this formula for a > > long > > > time, > > > > > this particular implication never occurred to me. Do you > > know > > if > > > > > anybody has done precision/recall or other relevancy > > empirical > > > > > measurements comparing this vs. a model that does not > > square > > idf? > > > > > > > > No, not that I know of. > > > > > > > > > Regarding normalization, the normalization in Hits does not > > have > > > very > > > > > nice properties. Due to the > 1.0 threshold check, it > > loses > > > > > information, and it arbitrarily defines the highest scoring > > result > > > in > > > > > any list that generates scores above 1.0 as a perfect > > match. > > It > > > would > > > > > be nice if score values were meaningful independent of > > searches, > > > e.g., > > > > > if "0.6" meant the same quality of retrieval independent of > > what > > > > search > > > > > was done. This would allow, for example, sites to use a a > > simple > > > > > quality threshold to only show results that were "good > > enough". > > > At my > > > > > last company (I was President and head of engineering for > > InQuira), > > > we > > > > > found this to be important to many customers. > > > > > > > > If this is a big issue for you, as it seems it is, please > > submit > > a > > > patch > > > > to optionally disable score normalization in Hits.java. > > > > > > > > > The standard vector space similarity measure includes > > > normalization by > > > > > the product of the norms of the vectors, i.e.: > > > > > > > > > > score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) > > / > > > > > sqrt [ (sum(t) weight(t,q)^2) * (sum(t) > > > > weight(t,d)^2) ] > > > > > > > > > > This makes the score a cosine, which since the values are > > all > > > positive, > > > > > forces it to be in [0, 1]. The sumOfSquares() > > normalization > > in > > > Lucene > > > > > does not fully implement this. Is there a specific reason > > for > > > that? > > > > > > > > The quantity 'sum(t) weight(t,d)^2' must be recomputed for > > each > > > document > > > > each time a document is added to the collection, since > > 'weight(t,d)' > > > is > > > > dependent on global term statistics. This is prohibitivly > > expensive. > > > > Research has also demonstrated that such cosine normalization > > gives > > > > somewhat inferior results (e.g., Singhal's pivoted length > > > normalization). > > > > > > > > > Re. explain(), I don't see a downside to extending it show > > the > > > final > > > > > normalization in Hits. It could still show the raw score > > just > > > prior > > > > to > > > > > that normalization. > > > > > > > > In order to normalize scores to 1.0 one must know the maximum > > score. > > > > Explain only computes the score for a single document, and > > the > > > maximum > > > > score is not known. > > > > > > > > > Although I think it would be best to have a > > > > > normalization that would render scores comparable across > > searches. > > > > > > > > Then please submit a patch. Lucene doesn't change on its > > own. > > > > > > > > Doug > > > > > > > > > > > > > > > > > > -------------------------------------------------------------------- > > > - > > > > To unsubscribe, e-mail: > > [EMAIL PROTECTED] > > > > For additional commands, e-mail: > > [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]