Hi Chuck, The relative lack of responses is not because there is no interest, but probably because there are only a few people on lucene-dev who can follow/understand every detail of your proposal. I understand and hear you, but I have a hard time 'visualizing' some of the formulas in your proposal. What you are saying sounds right to me, but I don't have enough theoretical knowledge to go one way or the other.
Otis --- Chuck Williams <[EMAIL PROTECTED]> wrote: > Hi everybody, > > Although there doesn't seem to be much interest in this I have one > significant improvement to the below and a couple observations that > clarify the situation. > > To illustrate the problem better normalization is intended to > address, > in my current application for BooleanQuery's of two terms, I > frequently > get a top score of 1.0 when only one of the terms is matched. 1.0 > should indicate a "perfect match". I'd like set my UI up to present > the > results differently depending on how good the different results are > (e.g., showing a visual indication of result quality, dropping the > really bad results entirely, and segregating the good results from > other > only vaguely relevant results). To build this kind of "intelligence" > into the UI, I certainly need to know whether my top result matched > all > query terms, or only half of them. I'd like to have the score tell > me > directly how good the matches are. The current normalization does > not > achieve this. > > The intent of a new normalization scheme is to preserve the current > scoring in the following sense: the ratio of the nth result's score > to > the best result's score remains the same. Therefore, the only > question > is what normalization factor to apply to all scores. This reduces to > a > very specific question that determines the entire normalization -- > what > should the score of the best matching result be? > > The mechanism below has this property, i.e. it keeps the current > score > ratios, except that I removed one idf term for reasons covered > earlier > (this better reflects the current empirically best scoring > algorithms). > However, removing an idf is a completely separate issue. The > improved > normalization is independent of whether or not that change is made. > > For the central question of what the top score should be, upon > reflection, I don't like the definition below. It defined the top > score > as (approximately) the percentage of query terms matched by the top > scoring result. A better conceptual definition is to use a weighted > average based on the boosts. I.e., downward propagate all boosts to > the > underlying terms (or phrases). Secifically, the "net boost" of a > term > is the product of the direct boost of the term and all boosts applied > to > encompassing clauses. Then the score of the top result becomes the > sum > of the net boosts of its matching terms divided by the sum of the net > boosts of all query terms. > > This definition is a refinement of the original proposal below, and > it > could probably be further refined if some impact of the tf, idf > and/or > lengthNorm was desired in determining the top score. These other > factors seems to be harder to normalize for, although I've thought of > some simple approaches; e.g., assume the unmatched terms in the top > result have values for these three factors that are the average of > the > values of the matched terms, then apply exactly the same concept of > actual score divided by theorectical maximum score. That would > eliminate any need to maintain maximum value statistics in the index. > > As an example of the simple boost-based normalization, for the query > ((a^2 b)^3 (c d^2)) > the net boosts are: > a --> 6 > b --> 3 > c --> 1 > d --> 2 > > So if a and b matched, but not c and d, in the top scoring result, > its > score would be 0.75. The normalizer would be 0.75/(current score > except > for the current normalization). This normalizer would be applied to > all > current scores (minus normalization) to create the normalized scores. > > For simple query (a b), if only one of the terms matched in the top > result, then its score would be 0.5, vs. 1.0 or many other possible > scores today. > > In addition to enabling more "intelligent" UI's that communicate the > quality of results to end-users, the proposal below also extends the > explain() mechanism to fully explain the final normalized score. > However, that change is also independent -- it could be done with the > current scoring. > > Am I the only one who would like to see better normalization in > Lucene? > Does anybody have a better approach? > > If you've read this far, thanks for indulging me on this. I would > love > to see this or something with similar properties in Lucene. I really > just want to build my app, but as stated below would write and > contribute this if there is interest in putting it in, and nobody > else > wants to write it. Please let me know what you think one way or the > other. > > Thanks, > > Chuck > > > > -----Original Message----- > > From: Chuck Williams > > Sent: Monday, October 18, 2004 7:04 PM > > To: 'Lucene Developers List' > > Subject: RE: idf and explain(), was Re: Search and Scoring > > > > Doug Cutting wrote: > > > If this is a big issue for you, as it seems it is, please > submit > a > > patch > > > to optionally disable score normalization in Hits.java. > > and: > > > The quantity 'sum(t) weight(t,d)^2' must be recomputed for > each > > document > > > each time a document is added to the collection, since > 'weight(t,d)' > > is > > > dependent on global term statistics. This is prohibitivly > expensive. > > > Research has also demonstrated that such cosine normalization > gives > > > somewhat inferior results (e.g., Singhal's pivoted length > > normalization). > > > > I'm willing to write, test and contribute code to address the > > normalization issue, i.e. to yield scores in [0, 1] that are > meaningful > > across searches. Unfortunately, this is considerably more > involved > that > > just optionally eliminating the current normalization in Hits. > Before > > undertaking this, I'd like to see if there is agreement in > principle > > that this is a good idea, and that my specific proposal below is > the > > right way to go. Also, I'd like to make sure I've correctly > inferred > > the constraints on writing code to be incorporated into Lucene. > > > > After looking at this in more detail I agree that the cosine > > normalization is not the way to go, because of both efficiency > and > > effectiveness considerations. A conceptual approach that would > be > > efficient, relatively easy to implement, and seems to have at > least > > reasonable behavior would be to define the top scoring match to > have > a > > score roughly equal to the percentage of query terms it matches > (its > > "netCoord"). Scores below the top hit would be reduced based on > the > > ratio of their raw scores to the raw score of the top hit > (considering > > all of the current score factors, except that I'd like to remove > one > of > > the idf factors, as discussed earlier). > > > > For a couple simple cases: > > a) the top match for a single term query would always have a > score > of > > 1.0, > > b) the top scoring match for a BooleanQuery using > DefaultSimilarity > > with all non-prohibited TermQuery clauses would have a score of > m/n, > > where the hit matches m of the n terms. > > > > This isn't optimal, but seems much better than the current > situation. > > Consider two single-term queries, s and t. If s matches more > strongly > > than t in its top hit (e.g., a higher tf in a shorter field), it > would > > be best if the top score of s was greater score than top score of > t. > > But this kind of normalization would require keeping additional > > statistics that so far as I know are not currently in the index, > like > > the maximum tf for terms and the minimum length for fields. > These > could > > be expensive to update on deletes. Also, normalizing by such > factors > > could yield lower than subjectively reasonable scores in most > cases, > so > > it's not clear it would be better. > > > > The semantics above are at least clean, easy to understand, and > support > > what seems to me is the most important motivation to do this: > allowing > > an application to use simple thresholding to segregate > likely-to-be- > > relevant hits from likely-to-be-irrelevant hits. > > > > More specifically, for a BooleanQuery of TermQuery's the > resulting > score > > functions would be: > > > > BooleanQuery of TermQuery's sbq = (tq1 ... tqn) > > > > baseScore(sbq, doc) = > > sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* > > lengthNorm(tqi.term.field, doc) > > > > rawScore(sbq, doc) = coord(sbq, doc) * baseScore > > > > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) > > > > score(sbq, doc) = rawScore * norm > > > > rawScore's can be computed in the Scorer.score() methods and > therefore > > used to sort the hits (e.g., in the instance method for collect() > in > the > > HitCollector in IndexSearcher.search()). If the top scoring hit > does > > not have the highest baseScore, then its score could be less that > its > > coord; this seems desirable. These formulas imply that no result > score > > can be larger than its coord, so if coord is well-defined (always > > between 0 and 1) then all results will be normalized between 0 > and > 1. > > > > In general, the netCoord, which takes the place of coord in the > simple > > case above, needs to be defined for any query. Conceptually, > this > > should be the total percentage of query terms matched by the > document. > > It must be recursively computable from the subquery structure and > > overridable for specific Query types (e.g., to support > specialized > > coords, like one that is always 1.0 as is useful in multi-field > > searching). Suitable default definitions for TermQuery and > BooleanQuery > > are: > > > > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise > > > > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * > ci.netCoord > > > > This is not quite percentage of terms matched; e.g., consider a > > BooleanQuery with two clauses, one of which is a BooleanQuery of > 3 > terms > > and the other which is just a term. However, it doesn't seem to > be > > unreasonable for a BooleanQuery to state that its clauses are > equally > > important, and this is consistent with the current coord > behavior. > > BooleanQuery.netCoord could be overridden for special cases, like > the > > pure disjunction I use in my app for field expansions: > > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord > > > > Implementing this would proceed along these lines: > > 1. For backwards compatibility, add some kind of newScoring > boolean > > setting. > > 2. Update all of these places to behave as indicated if > newScoring: > > a. Change Query.weight() to not do any normalization (no > call > to > > sumOfSquaredWeights(), Similarity.queryNorm() or normalize()). > > b. Update all Query.weight classes to set their value > according > to > > the terms in the score formula above that don't involve the > document > > (e.g., boost*idf for TermQuery). > > c. Add suitable netCoord definitions to all Scorer classes. > > d. Update all Scorer.score() methods to compute the rawScore > as > > above. > > e. Add the maximum baseScore as a field kept on TopDocs, > computed > > in the HitCollector's. > > f. Change the normalization in Hits to always divide every > raw > > score by the maximum baseScore. > > g. Update all of the current explain() methods to be > consistent > > with this scoring, and to either report the rawScore, or to > report > the > > final score if the normalization factor is provided. > > h. Add Hits.explain() (or better extend Searcher so that it > keeps > > the Hits for use in Searcher.explain()) to call the new explain > > variation with the normalization factor so that final scores are > fully > > explained. > > > > If this seems like a good idea, please let me know. I'm sure > there > are > > details I've missed that would come out during coding and > testing. > Also, > > the value of this is dependent on how reasonable the final scores > look, > > which is hard to tell for sure until it is working. > > > > The coding standards for Lucene seem reasonably clear from the > source > > code I've read. I could use just simple Java so there shouldn't > be > any > > significant JVM dependencies. The above should be fully backward > > compatible due to the newScoring flag. > > > > On another note, I had to remove the German analyzer in my > current > 1.4.2 > > source configuration because GermanStemmer failed to compile due > to > what > > are apparently Unicode character constants that I've now got as > illegal > > two-character character constants. This is presumably an > encoding > > problem somewhere that I could track down. It's not important, > but > if > > the answer is obvious to any of you, I'd appreciate the quick > tip. > > > > Thanks, > > > > Chuck > > > > > -----Original Message----- > > > From: Doug Cutting [mailto:[EMAIL PROTECTED] > > > Sent: Monday, October 18, 2004 9:44 AM > > > To: Lucene Developers List > > > Subject: Re: idf and explain(), was Re: Search and Scoring > > > > > > Chuck Williams wrote: > > > > That's a good point on how the standard vector space inner > product > > > > similarity measure does imply that the idf is squared > relative > to > > the > > > > document tf. Even having been aware of this formula for a > long > > time, > > > > this particular implication never occurred to me. Do you > know > if > > > > anybody has done precision/recall or other relevancy > empirical > > > > measurements comparing this vs. a model that does not > square > idf? > > > > > > No, not that I know of. > > > > > > > Regarding normalization, the normalization in Hits does not > have > > very > > > > nice properties. Due to the > 1.0 threshold check, it > loses > > > > information, and it arbitrarily defines the highest scoring > result > > in > > > > any list that generates scores above 1.0 as a perfect > match. > It > > would > > > > be nice if score values were meaningful independent of > searches, > > e.g., > > > > if "0.6" meant the same quality of retrieval independent of > what > > > search > > > > was done. This would allow, for example, sites to use a a > simple > > > > quality threshold to only show results that were "good > enough". > > At my > > > > last company (I was President and head of engineering for > InQuira), > > we > > > > found this to be important to many customers. > > > > > > If this is a big issue for you, as it seems it is, please > submit > a > > patch > > > to optionally disable score normalization in Hits.java. > > > > > > > The standard vector space similarity measure includes > > normalization by > > > > the product of the norms of the vectors, i.e.: > > > > > > > > score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) > / > > > > sqrt [ (sum(t) weight(t,q)^2) * (sum(t) > > > weight(t,d)^2) ] > > > > > > > > This makes the score a cosine, which since the values are > all > > positive, > > > > forces it to be in [0, 1]. The sumOfSquares() > normalization > in > > Lucene > > > > does not fully implement this. Is there a specific reason > for > > that? > > > > > > The quantity 'sum(t) weight(t,d)^2' must be recomputed for > each > > document > > > each time a document is added to the collection, since > 'weight(t,d)' > > is > > > dependent on global term statistics. This is prohibitivly > expensive. > > > Research has also demonstrated that such cosine normalization > gives > > > somewhat inferior results (e.g., Singhal's pivoted length > > normalization). > > > > > > > Re. explain(), I don't see a downside to extending it show > the > > final > > > > normalization in Hits. It could still show the raw score > just > > prior > > > to > > > > that normalization. > > > > > > In order to normalize scores to 1.0 one must know the maximum > score. > > > Explain only computes the score for a single document, and > the > > maximum > > > score is not known. > > > > > > > Although I think it would be best to have a > > > > normalization that would render scores comparable across > searches. > > > > > > Then please submit a patch. Lucene doesn't change on its > own. > > > > > > Doug > > > > > > > > > > > > > -------------------------------------------------------------------- > > - > > > To unsubscribe, e-mail: > [EMAIL PROTECTED] > > > For additional commands, e-mail: > [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]