Chuck - if you provide patches and post them to Bugzilla, one of the developers will try them locally and either provide feedback or merge your changes into CVS.
Otis --- Chuck Williams <[EMAIL PROTECTED]> wrote: > Thanks Otis. Other than trying to get some consensus a) that this is > a > problem worth fixing, and b) on the best approach to fix it, my > central > question is, if I fix it is it likely to get incorporated back into > Lucene? I don't want to deviate from Lucene sources, especially with > so > many classes, and so would like to address this only if there is a > process to evaluate the changes and incorporate them back into Lucene > if > they provide the improvement I believe they will. > > Thanks for any guidance on that, > > Chuck > > > -----Original Message----- > > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > > Sent: Thursday, October 21, 2004 10:06 AM > > To: Lucene Developers List > > Subject: Re: Normalized Scoring -- was RE: idf and explain(), was > Re: > > Search and Scoring > > > > Hi Chuck, > > > > The relative lack of responses is not because there is no > interest, > but > > probably because there are only a few people on lucene-dev who > can > > follow/understand every detail of your proposal. I understand > and > hear > > you, but I have a hard time 'visualizing' some of the formulas in > your > > proposal. What you are saying sounds right to me, but I don't > have > > enough theoretical knowledge to go one way or the other. > > > > Otis > > > > > > --- Chuck Williams <[EMAIL PROTECTED]> wrote: > > > > > Hi everybody, > > > > > > Although there doesn't seem to be much interest in this I have > one > > > significant improvement to the below and a couple observations > that > > > clarify the situation. > > > > > > To illustrate the problem better normalization is intended to > > > address, > > > in my current application for BooleanQuery's of two terms, I > > > frequently > > > get a top score of 1.0 when only one of the terms is matched. > 1.0 > > > should indicate a "perfect match". I'd like set my UI up to > present > > > the > > > results differently depending on how good the different results > are > > > (e.g., showing a visual indication of result quality, dropping > the > > > really bad results entirely, and segregating the good results > from > > > other > > > only vaguely relevant results). To build this kind of > "intelligence" > > > into the UI, I certainly need to know whether my top result > matched > > > all > > > query terms, or only half of them. I'd like to have the score > tell > > > me > > > directly how good the matches are. The current normalization > does > > > not > > > achieve this. > > > > > > The intent of a new normalization scheme is to preserve the > current > > > scoring in the following sense: the ratio of the nth result's > score > > > to > > > the best result's score remains the same. Therefore, the only > > > question > > > is what normalization factor to apply to all scores. This > reduces > to > > > a > > > very specific question that determines the entire normalization > -- > > > what > > > should the score of the best matching result be? > > > > > > The mechanism below has this property, i.e. it keeps the > current > > > score > > > ratios, except that I removed one idf term for reasons covered > > > earlier > > > (this better reflects the current empirically best scoring > > > algorithms). > > > However, removing an idf is a completely separate issue. The > > > improved > > > normalization is independent of whether or not that change is > made. > > > > > > For the central question of what the top score should be, upon > > > reflection, I don't like the definition below. It defined the > top > > > score > > > as (approximately) the percentage of query terms matched by the > top > > > scoring result. A better conceptual definition is to use a > weighted > > > average based on the boosts. I.e., downward propagate all > boosts > to > > > the > > > underlying terms (or phrases). Secifically, the "net boost" of > a > > > term > > > is the product of the direct boost of the term and all boosts > applied > > > to > > > encompassing clauses. Then the score of the top result becomes > the > > > sum > > > of the net boosts of its matching terms divided by the sum of > the > net > > > boosts of all query terms. > > > > > > This definition is a refinement of the original proposal below, > and > > > it > > > could probably be further refined if some impact of the tf, idf > > > and/or > > > lengthNorm was desired in determining the top score. These > other > > > factors seems to be harder to normalize for, although I've > thought > of > > > some simple approaches; e.g., assume the unmatched terms in the > top > > > result have values for these three factors that are the average > of > > > the > > > values of the matched terms, then apply exactly the same > concept > of > > > actual score divided by theorectical maximum score. That would > > > eliminate any need to maintain maximum value statistics in the > index. > > > > > > As an example of the simple boost-based normalization, for the > query > > > ((a^2 b)^3 (c d^2)) > > > the net boosts are: > > > a --> 6 > > > b --> 3 > > > c --> 1 > > > d --> 2 > > > > > > So if a and b matched, but not c and d, in the top scoring > result, > > > its > > > score would be 0.75. The normalizer would be 0.75/(current > score > > > except > > > for the current normalization). This normalizer would be > applied > to > > > all > > > current scores (minus normalization) to create the normalized > scores. > > > > > > For simple query (a b), if only one of the terms matched in the > top > > > result, then its score would be 0.5, vs. 1.0 or many other > possible > > > scores today. > > > > > > In addition to enabling more "intelligent" UI's that > communicate > the > > > quality of results to end-users, the proposal below also > extends > the > > > explain() mechanism to fully explain the final normalized > score. > > > However, that change is also independent -- it could be done > with > the > > > current scoring. > > > > > > Am I the only one who would like to see better normalization in > > > Lucene? > > > Does anybody have a better approach? > > > > > > If you've read this far, thanks for indulging me on this. I > would > > > love > > > to see this or something with similar properties in Lucene. I > really > > > just want to build my app, but as stated below would write and > > > contribute this if there is interest in putting it in, and > nobody > > > else > > > wants to write it. Please let me know what you think one way > or > the > > > other. > > > > > > Thanks, > > > > > > Chuck > > > > > > > > > > -----Original Message----- > > > > From: Chuck Williams > > > > Sent: Monday, October 18, 2004 7:04 PM > > > > To: 'Lucene Developers List' > > > > Subject: RE: idf and explain(), was Re: Search and Scoring > > > > > > > > Doug Cutting wrote: > > > > > If this is a big issue for you, as it seems it is, > please > > > submit > > > a > > > > patch > > > > > to optionally disable score normalization in Hits.java. > > > > and: > > > > > The quantity 'sum(t) weight(t,d)^2' must be recomputed > for > > > each > > > > document > > > > > each time a document is added to the collection, since > > > 'weight(t,d)' > > > > is > > > > > dependent on global term statistics. This is > prohibitivly > > > expensive. > > > > > Research has also demonstrated that such cosine > normalization > > > gives > > > > > somewhat inferior results (e.g., Singhal's pivoted > length > > > > normalization). > > > > > > > > I'm willing to write, test and contribute code to address > the > > > > normalization issue, i.e. to yield scores in [0, 1] that > are > > > meaningful > > > > across searches. Unfortunately, this is considerably more > > > involved > > > that > > > > just optionally eliminating the current normalization in > Hits. > > > Before > > > > undertaking this, I'd like to see if there is agreement in > > > principle > > > > that this is a good idea, and that my specific proposal > below > is > > > the > > > > right way to go. Also, I'd like to make sure I've > correctly > > > inferred > > > > the constraints on writing code to be incorporated into > Lucene. > > > > > > > > After looking at this in more detail I agree that the > cosine > > > > normalization is not the way to go, because of both > efficiency > > > and > > > > effectiveness considerations. A conceptual approach that > would > > > be > > > > efficient, relatively easy to implement, and seems to have > at > > > least > > > > reasonable behavior would be to define the top scoring > match > to > > > have > > > a > > > > score roughly equal to the percentage of query terms it > matches > > > (its > > > > "netCoord"). Scores below the top hit would be reduced > based > on > > > the > > > > ratio of their raw scores to the raw score of the top hit > > > (considering > > > > all of the current score factors, except that I'd like to > remove > > > one > > > of > > > > the idf factors, as discussed earlier). > > > > > > > > For a couple simple cases: > > > > a) the top match for a single term query would always > have a > > > score > > > of > > > > 1.0, > > > > b) the top scoring match for a BooleanQuery using > > > DefaultSimilarity > > > > with all non-prohibited TermQuery clauses would have a > score > of > > > m/n, > > > > where the hit matches m of the n terms. > > > > > > > > This isn't optimal, but seems much better than the current > > > situation. > > > > Consider two single-term queries, s and t. If s matches > more > > > strongly > > > > than t in its top hit (e.g., a higher tf in a shorter > field), > it > > > would > > > > be best if the top score of s was greater score than top > score > of > > > t. > > > > But this kind of normalization would require keeping > additional > > > > statistics that so far as I know are not currently in the > index, > > > like > > > > the maximum tf for terms and the minimum length for fields. > > > These > > > could > > > > be expensive to update on deletes. Also, normalizing by > such > > > factors > > > > could yield lower than subjectively reasonable scores in > most > > > cases, > > > so > > > > it's not clear it would be better. > > > > > > > > The semantics above are at least clean, easy to understand, > and > > > support > > > > what seems to me is the most important motivation to do > this: > > > allowing > > > > an application to use simple thresholding to segregate > > > likely-to-be- > > > > relevant hits from likely-to-be-irrelevant hits. > > > > > > > > More specifically, for a BooleanQuery of TermQuery's the > > > resulting > > > score > > > > functions would be: > > > > > > > > BooleanQuery of TermQuery's sbq = (tq1 ... tqn) > > > > > > > > baseScore(sbq, doc) = > > > > sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* > > > > lengthNorm(tqi.term.field, doc) > > > > > > > > rawScore(sbq, doc) = coord(sbq, doc) * baseScore > > > > > > > > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) > > > > > > > > score(sbq, doc) = rawScore * norm > > > > > > > > rawScore's can be computed in the Scorer.score() methods > and > > > therefore > > > > used to sort the hits (e.g., in the instance method for > collect() > > > in > > > the > > > > HitCollector in IndexSearcher.search()). If the top > scoring > hit > > > does > > > > not have the highest baseScore, then its score could be > less > that > > > its > > > > coord; this seems desirable. These formulas imply that no > result > > > score > > > > can be larger than its coord, so if coord is well-defined > (always > > > > between 0 and 1) then all results will be normalized > between 0 > > > and > > > 1. > > > > > > > > In general, the netCoord, which takes the place of coord in > the > > > simple > > > > case above, needs to be defined for any query. > Conceptually, > > > this > > > > should be the total percentage of query terms matched by > the > > > document. > > > > It must be recursively computable from the subquery > structure > and > > > > overridable for specific Query types (e.g., to support > > > specialized > > > > coords, like one that is always 1.0 as is useful in > multi-field > > > > searching). Suitable default definitions for TermQuery and > > > BooleanQuery > > > > are: > > > > > > > > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise > > > > > > > > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * > > > ci.netCoord > > > > > > > > This is not quite percentage of terms matched; e.g., > consider > a > > > > BooleanQuery with two clauses, one of which is a > BooleanQuery > of > > > 3 > > > terms > > > > and the other which is just a term. However, it doesn't > seem > to > > > be > > > > unreasonable for a BooleanQuery to state that its clauses > are > > > equally > > > > important, and this is consistent with the current coord > > > behavior. > > > > BooleanQuery.netCoord could be overridden for special > cases, > like > > > the > > > > pure disjunction I use in my app for field expansions: > > > > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) > ci.netCoord > > > > > > > > Implementing this would proceed along these lines: > > > > 1. For backwards compatibility, add some kind of > newScoring > > > boolean > > > > setting. > > > > 2. Update all of these places to behave as indicated if > > > newScoring: > > > > a. Change Query.weight() to not do any normalization > (no > > > call > > > to > > > > sumOfSquaredWeights(), Similarity.queryNorm() or > normalize()). > > > > b. Update all Query.weight classes to set their value > > > according > > > to > > > > the terms in the score formula above that don't involve the > > > document > > > > (e.g., boost*idf for TermQuery). > > > > c. Add suitable netCoord definitions to all Scorer > classes. > > > > d. Update all Scorer.score() methods to compute the > rawScore > > > as > > > > above. > > > > e. Add the maximum baseScore as a field kept on > TopDocs, > > > computed > > > > in the HitCollector's. > > > > f. Change the normalization in Hits to always divide > every > > > raw > > > > score by the maximum baseScore. > > > > g. Update all of the current explain() methods to be > > > consistent > > > > with this scoring, and to either report the rawScore, or to > > > report > > > the > > > > final score if the normalization factor is provided. > > > > h. Add Hits.explain() (or better extend Searcher so > that > it > > > keeps > > > > the Hits for use in Searcher.explain()) to call the new > explain > > > > variation with the normalization factor so that final > scores > are > > > fully > > > > explained. > > > > > > > > If this seems like a good idea, please let me know. I'm > sure > > > there > > > are > > > > details I've missed that would come out during coding and > > > testing. > > > Also, > > > > the value of this is dependent on how reasonable the final > scores > > > look, > > > > which is hard to tell for sure until it is working. > > > > > > > > The coding standards for Lucene seem reasonably clear from > the > > > source > > > > code I've read. I could use just simple Java so there > shouldn't > > > be > > > any > > > > significant JVM dependencies. The above should be fully > backward > > > > compatible due to the newScoring flag. > > > > > > > > On another note, I had to remove the German analyzer in my > > > current > > > 1.4.2 > > > > source configuration because GermanStemmer failed to > compile > due > > > to > > > what > > > > are apparently Unicode character constants that I've now > got > as > > > illegal > > > > two-character character constants. This is presumably an > > > encoding > > > > problem somewhere that I could track down. It's not > important, > > > but > > > if > > > > the answer is obvious to any of you, I'd appreciate the > quick > > > tip. > > > > > > > > Thanks, > > > > > > > > Chuck > > > > > > > > > -----Original Message----- > > > > > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > > > > Sent: Monday, October 18, 2004 9:44 AM > > > > > To: Lucene Developers List > > > > > Subject: Re: idf and explain(), was Re: Search and > Scoring > > > > > > > > > > Chuck Williams wrote: > > > > > > That's a good point on how the standard vector space > inner > > > product > > > > > > similarity measure does imply that the idf is squared > > > relative > > > to > > > > the > > > > > > document tf. Even having been aware of this formula > for > a > > > long > > > > time, > > > > > > this particular implication never occurred to me. Do > you > > > know > > > if > > > > > > anybody has done precision/recall or other relevancy > > > empirical > > > > > > measurements comparing this vs. a model that does not > > > square > > > idf? > > > > > > > > > > No, not that I know of. > > > > > > > > > > > Regarding normalization, the normalization in Hits > does > not > > > have > > > > very > > > > > > nice properties. Due to the > 1.0 threshold check, > it > > > loses > > > > > > information, and it arbitrarily defines the highest > scoring > > > result > > > > in > > > > > > any list that generates scores above 1.0 as a perfect > > > match. > > > It > > > > would > > > > > > be nice if score values were meaningful independent > of > > > searches, > > > > e.g., > > > > > > if "0.6" meant the same quality of retrieval > independent > of > > > what > > > > > search > > > > > > was done. This would allow, for example, sites to > use a > a > > > simple > > > > > > quality threshold to only show results that were > "good > > > enough". > > > > At my > > > > > > last company (I was President and head of engineering > for > > > InQuira), > > > > we > > > > > > found this to be important to many customers. > > > > > > > > > > If this is a big issue for you, as it seems it is, > please > > > submit > > > a > > > > patch > > > > > to optionally disable score normalization in Hits.java. > > > > > > > > > > > The standard vector space similarity measure includes > > > > normalization by > > > > > > the product of the norms of the vectors, i.e.: > > > > > > > > > > > > score(d,q) = sum over t of ( weight(t,q) * > weight(t,d) ) > > > / > > > > > > sqrt [ (sum(t) weight(t,q)^2) * > (sum(t) > > > > > weight(t,d)^2) ] > > > > > > > > > > > > This makes the score a cosine, which since the values > are > > > all > > > > positive, > > > > > > forces it to be in [0, 1]. The sumOfSquares() > > > normalization > > > in > > > > Lucene > > > > > > does not fully implement this. Is there a specific > reason > > > for > > > > that? > > > > > > > > > > The quantity 'sum(t) weight(t,d)^2' must be recomputed > for > > > each > > > > document > > > > > each time a document is added to the collection, since > > > 'weight(t,d)' > > > > is > > > > > dependent on global term statistics. This is > prohibitivly > > > expensive. > > > > > Research has also demonstrated that such cosine > normalization > > > gives > > > > > somewhat inferior results (e.g., Singhal's pivoted > length > > > > normalization). > > > > > > > > > > > Re. explain(), I don't see a downside to extending it > show > > > the > > > > final > > > > > > normalization in Hits. It could still show the raw > score > > > just > > > > prior > > > > > to > > > > > > that normalization. > > > > > > > > > > In order to normalize scores to 1.0 one must know the > maximum > > > score. > > > > > Explain only computes the score for a single document, > and > > > the > > > > maximum > > > > > score is not known. > > > > > > > > > > > Although I think it would be best to have a > > > > > > normalization that would render scores comparable > across > > > searches. > > > > > > > > > > Then please submit a patch. Lucene doesn't change on > its > > > own. > > > > > > > > > > Doug > > > > > > > > > > > > > > > > > > > > > > > > -------------------------------------------------------------------- > > > > - > > > > > To unsubscribe, e-mail: > > > [EMAIL PROTECTED] > > > > > For additional commands, e-mail: > > > [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]