Hi Otis, I was re-reading this whole theoretical thread about idf, scoring, normalization, etc from last Oct and couldn't help laughing out loud when I read your post, coz it summed up what I was thinking the whole time. I think its really great to have people like Chuck and Paul (Eshlot) around. I'm learning so much.
k On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote: > Hi Chuck, > > The relative lack of responses is not because there is no interest, > but probably because there are only a few people on lucene-dev who > can follow/understand every detail of your proposal. I understand > and hear you, but I have a hard time 'visualizing' some of the > formulas in your proposal. What you are saying sounds right to me, > but I don't have enough theoretical knowledge to go one way or the > other. > > Otis > > > --- Chuck Williams <[EMAIL PROTECTED]> wrote: > >> Hi everybody, >> >> Although there doesn't seem to be much interest in this I have >> one significant improvement to the below and a couple >> observations that clarify the situation. >> >> To illustrate the problem better normalization is intended to >> address, >> in my current application for BooleanQuery's of two terms, I >> frequently >> get a top score of 1.0 when only one of the terms is matched. >> 1.0 should indicate a "perfect match". I'd like set my UI up to >> present the >> results differently depending on how good the different results >> are (e.g., showing a visual indication of result quality, >> dropping the really bad results entirely, and segregating the >> good results from other >> only vaguely relevant results). To build this kind of >> "intelligence" into the UI, I certainly need to know whether my >> top result matched all >> query terms, or only half of them. I'd like to have the score >> tell me >> directly how good the matches are. The current normalization >> does not achieve this. >> >> The intent of a new normalization scheme is to preserve the >> current scoring in the following sense: the ratio of the nth >> result's score to >> the best result's score remains the same. Therefore, the only >> question >> is what normalization factor to apply to all scores. This >> reduces to a >> very specific question that determines the entire normalization -- >> what should the score of the best matching result be? >> >> The mechanism below has this property, i.e. it keeps the current >> score >> ratios, except that I removed one idf term for reasons covered >> earlier >> (this better reflects the current empirically best scoring >> algorithms). >> However, removing an idf is a completely separate issue. The >> improved >> normalization is independent of whether or not that change is >> made. >> >> For the central question of what the top score should be, upon >> reflection, I don't like the definition below. It defined the >> top score >> as (approximately) the percentage of query terms matched by the >> top scoring result. A better conceptual definition is to use a >> weighted average based on the boosts. I.e., downward propagate >> all boosts to the >> underlying terms (or phrases). Secifically, the "net boost" of a >> term >> is the product of the direct boost of the term and all boosts >> applied to >> encompassing clauses. Then the score of the top result becomes >> the sum >> of the net boosts of its matching terms divided by the sum of the >> net boosts of all query terms. >> >> This definition is a refinement of the original proposal below, >> and it >> could probably be further refined if some impact of the tf, idf >> and/or >> lengthNorm was desired in determining the top score. These other >> factors seems to be harder to normalize for, although I've >> thought of some simple approaches; e.g., assume the unmatched >> terms in the top result have values for these three factors that >> are the average of the >> values of the matched terms, then apply exactly the same concept >> of actual score divided by theorectical maximum score. That >> would eliminate any need to maintain maximum value statistics in >> the index. >> >> As an example of the simple boost-based normalization, for the >> query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --> 3 c -- >> > 1 d --> 2 >> >> So if a and b matched, but not c and d, in the top scoring >> result, its >> score would be 0.75. The normalizer would be 0.75/(current score >> except >> for the current normalization). This normalizer would be applied >> to all >> current scores (minus normalization) to create the normalized >> scores. >> >> For simple query (a b), if only one of the terms matched in the >> top result, then its score would be 0.5, vs. 1.0 or many other >> possible scores today. >> >> In addition to enabling more "intelligent" UI's that communicate >> the quality of results to end-users, the proposal below also >> extends the explain() mechanism to fully explain the final >> normalized score. However, that change is also independent -- it >> could be done with the current scoring. >> >> Am I the only one who would like to see better normalization in >> Lucene? Does anybody have a better approach? >> >> If you've read this far, thanks for indulging me on this. I >> would love >> to see this or something with similar properties in Lucene. I >> really just want to build my app, but as stated below would write >> and contribute this if there is interest in putting it in, and >> nobody else >> wants to write it. Please let me know what you think one way or >> the other. >> >> Thanks, >> >> Chuck >> >> >>> -----Original Message----- >>> From: Chuck Williams >>> Sent: Monday, October 18, 2004 7:04 PM >>> To: 'Lucene Developers List' >>> Subject: RE: idf and explain(), was Re: Search and Scoring >>> >>> Doug Cutting wrote: >>>> If this is a big issue for you, as it seems it is, please >>>> >> submit >> a >>> patch >>>> to optionally disable score normalization in Hits.java. >>>> >>> and: >>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for >>>> >> each >>> document >>>> each time a document is added to the collection, since >>>> >> 'weight(t,d)' >>> is >>>> dependent on global term statistics. This is prohibitivly >>>> >> expensive. >>>> Research has also demonstrated that such cosine normalization >>>> >> gives >>>> somewhat inferior results (e.g., Singhal's pivoted length >>>> >>> normalization). >>> >>> I'm willing to write, test and contribute code to address the >>> normalization issue, i.e. to yield scores in [0, 1] that are >>> >> meaningful >>> across searches. Unfortunately, this is considerably more >>> >> involved >> that >>> just optionally eliminating the current normalization in Hits. >>> >> Before >>> undertaking this, I'd like to see if there is agreement in >>> >> principle >>> that this is a good idea, and that my specific proposal below is >>> >> the >>> right way to go. Also, I'd like to make sure I've correctly >>> >> inferred >>> the constraints on writing code to be incorporated into Lucene. >>> >>> After looking at this in more detail I agree that the cosine >>> normalization is not the way to go, because of both efficiency >>> >> and >>> effectiveness considerations. A conceptual approach that would >>> >> be >>> efficient, relatively easy to implement, and seems to have at >>> >> least >>> reasonable behavior would be to define the top scoring match to >>> >> have >> a >>> score roughly equal to the percentage of query terms it matches >>> >> (its >>> "netCoord"). Scores below the top hit would be reduced based on >>> >> the >>> ratio of their raw scores to the raw score of the top hit >>> >> (considering >>> all of the current score factors, except that I'd like to remove >>> >> one >> of >>> the idf factors, as discussed earlier). >>> >>> For a couple simple cases: >>> a) the top match for a single term query would always have a >>> >> score >> of >>> 1.0, >>> b) the top scoring match for a BooleanQuery using >>> >> DefaultSimilarity >>> with all non-prohibited TermQuery clauses would have a score of >>> >> m/n, >>> where the hit matches m of the n terms. >>> >>> This isn't optimal, but seems much better than the current >>> >> situation. >>> Consider two single-term queries, s and t. If s matches more >>> >> strongly >>> than t in its top hit (e.g., a higher tf in a shorter field), it >>> >> would >>> be best if the top score of s was greater score than top score >>> of >>> >> t. >>> But this kind of normalization would require keeping additional >>> statistics that so far as I know are not currently in the index, >>> >> like >>> the maximum tf for terms and the minimum length for fields. >>> >> These >> could >>> be expensive to update on deletes. Also, normalizing by such >>> >> factors >>> could yield lower than subjectively reasonable scores in most >>> >> cases, >> so >>> it's not clear it would be better. >>> >>> The semantics above are at least clean, easy to understand, and >>> >> support >>> what seems to me is the most important motivation to do this: >>> >> allowing >>> an application to use simple thresholding to segregate >>> >> likely-to-be- >>> relevant hits from likely-to-be-irrelevant hits. >>> >>> More specifically, for a BooleanQuery of TermQuery's the >>> >> resulting >> score >>> functions would be: >>> >>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn) >>> >>> baseScore(sbq, doc) = >>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* >>> lengthNorm(tqi.term.field, doc) >>> >>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore >>> >>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) >>> >>> score(sbq, doc) = rawScore * norm >>> >>> rawScore's can be computed in the Scorer.score() methods and >>> >> therefore >>> used to sort the hits (e.g., in the instance method for >>> collect() >>> >> in >> the >>> HitCollector in IndexSearcher.search()). If the top scoring hit >>> >> does >>> not have the highest baseScore, then its score could be less >>> that >>> >> its >>> coord; this seems desirable. These formulas imply that no >>> result >>> >> score >>> can be larger than its coord, so if coord is well-defined >>> (always between 0 and 1) then all results will be normalized >>> between 0 >>> >> and >> 1. >> >>> In general, the netCoord, which takes the place of coord in the >>> >> simple >>> case above, needs to be defined for any query. Conceptually, >>> >> this >>> should be the total percentage of query terms matched by the >>> >> document. >>> It must be recursively computable from the subquery structure >>> and overridable for specific Query types (e.g., to support >>> >> specialized >>> coords, like one that is always 1.0 as is useful in multi-field >>> searching). Suitable default definitions for TermQuery and >>> >> BooleanQuery >>> are: >>> >>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise >>> >>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * >>> >> ci.netCoord >> >>> This is not quite percentage of terms matched; e.g., consider a >>> BooleanQuery with two clauses, one of which is a BooleanQuery of >>> >> 3 >> terms >>> and the other which is just a term. However, it doesn't seem to >>> >> be >>> unreasonable for a BooleanQuery to state that its clauses are >>> >> equally >>> important, and this is consistent with the current coord >>> >> behavior. >>> BooleanQuery.netCoord could be overridden for special cases, >>> like >>> >> the >>> pure disjunction I use in my app for field expansions: >>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord >>> >>> Implementing this would proceed along these lines: 1. For >>> backwards compatibility, add some kind of newScoring >>> >> boolean >>> setting. >>> 2. Update all of these places to behave as indicated if >>> >> newScoring: >>> a. Change Query.weight() to not do any normalization (no >>> >> call >> to >>> sumOfSquaredWeights(), Similarity.queryNorm() or normalize()). >>> b. Update all Query.weight classes to set their value >>> >> according >> to >>> the terms in the score formula above that don't involve the >>> >> document >>> (e.g., boost*idf for TermQuery). >>> c. Add suitable netCoord definitions to all Scorer classes. d. >>> Update all Scorer.score() methods to compute the rawScore >>> >> as >>> above. >>> e. Add the maximum baseScore as a field kept on TopDocs, >>> >> computed >>> in the HitCollector's. >>> f. Change the normalization in Hits to always divide every >>> >> raw >>> score by the maximum baseScore. >>> g. Update all of the current explain() methods to be >>> >> consistent >>> with this scoring, and to either report the rawScore, or to >>> >> report >> the >>> final score if the normalization factor is provided. h. Add >>> Hits.explain() (or better extend Searcher so that it >>> >> keeps >>> the Hits for use in Searcher.explain()) to call the new explain >>> variation with the normalization factor so that final scores are >>> >> fully >>> explained. >>> >>> If this seems like a good idea, please let me know. I'm sure >>> >> there >> are >>> details I've missed that would come out during coding and >>> >> testing. >> Also, >>> the value of this is dependent on how reasonable the final >>> scores >>> >> look, >>> which is hard to tell for sure until it is working. >>> >>> The coding standards for Lucene seem reasonably clear from the >>> >> source >>> code I've read. I could use just simple Java so there shouldn't >>> >> be >> any >>> significant JVM dependencies. The above should be fully >>> backward compatible due to the newScoring flag. >>> >>> On another note, I had to remove the German analyzer in my >>> >> current >> 1.4.2 >>> source configuration because GermanStemmer failed to compile due >>> >> to >> what >>> are apparently Unicode character constants that I've now got as >>> >> illegal >>> two-character character constants. This is presumably an >>> >> encoding >>> problem somewhere that I could track down. It's not important, >>> >> but >> if >>> the answer is obvious to any of you, I'd appreciate the quick >>> >> tip. >> >>> Thanks, >>> >>> Chuck >>> >>>> -----Original Message----- >>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, >>>> October 18, 2004 9:44 AM To: Lucene Developers List >>>> Subject: Re: idf and explain(), was Re: Search and Scoring >>>> >>>> Chuck Williams wrote: >>>>> That's a good point on how the standard vector space inner >>>>> >> product >>>>> similarity measure does imply that the idf is squared >>>>> >> relative >> to >>> the >>>>> document tf. Even having been aware of this formula for a >>>>> >> long >>> time, >>>>> this particular implication never occurred to me. Do you >>>>> >> know >> if >>>>> anybody has done precision/recall or other relevancy >>>>> >> empirical >>>>> measurements comparing this vs. a model that does not >>>>> >> square >> idf? >> >>>> No, not that I know of. >>>> >>>>> Regarding normalization, the normalization in Hits does not >>>>> >> have >>> very >>>>> nice properties. Due to the > 1.0 threshold check, it >>>>> >> loses >>>>> information, and it arbitrarily defines the highest scoring >>>>> >> result >>> in >>>>> any list that generates scores above 1.0 as a perfect >>>>> >> match. >> It >>> would >>>>> be nice if score values were meaningful independent of >>>>> >> searches, >>> e.g., >>>>> if "0.6" meant the same quality of retrieval independent of >>>>> >> what >>>> search >>>>> was done. This would allow, for example, sites to use a a >>>>> >> simple >>>>> quality threshold to only show results that were "good >>>>> >> enough". >>> At my >>>>> last company (I was President and head of engineering for >>>>> >> InQuira), >>> we >>>>> found this to be important to many customers. >>>>> >>>> If this is a big issue for you, as it seems it is, please >>>> >> submit >> a >>> patch >>>> to optionally disable score normalization in Hits.java. >>>> >>>>> The standard vector space similarity measure includes >>>>> >>> normalization by >>>>> the product of the norms of the vectors, i.e.: >>>>> >>>>> score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) >>>>> >> / >>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t) >>>>> >>>> weight(t,d)^2) ] >>>> >>>>> This makes the score a cosine, which since the values are >>>>> >> all >>> positive, >>>>> forces it to be in [0, 1]. The sumOfSquares() >>>>> >> normalization >> in >>> Lucene >>>>> does not fully implement this. Is there a specific reason >>>>> >> for >>> that? >>> >>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for >>>> >> each >>> document >>>> each time a document is added to the collection, since >>>> >> 'weight(t,d)' >>> is >>>> dependent on global term statistics. This is prohibitivly >>>> >> expensive. >>>> Research has also demonstrated that such cosine normalization >>>> >> gives >>>> somewhat inferior results (e.g., Singhal's pivoted length >>>> >>> normalization). >>> >>>>> Re. explain(), I don't see a downside to extending it show >>>>> >> the >>> final >>>>> normalization in Hits. It could still show the raw score >>>>> >> just >>> prior >>>> to >>>>> that normalization. >>>>> >>>> In order to normalize scores to 1.0 one must know the maximum >>>> >> score. >>>> Explain only computes the score for a single document, and >>>> >> the >>> maximum >>>> score is not known. >>>> >>>>> Although I think it would be best to have a normalization >>>>> that would render scores comparable across >>>>> >> searches. >> >>>> Then please submit a patch. Lucene doesn't change on its >>>> >> own. >> >>>> Doug >>>> >>>> >> ------------------------------------------------------------------ >> -- >> >>> - >>>> To unsubscribe, e-mail: >>[EMAIL PROTECTED] >>>> For additional commands, e-mail: >>[EMAIL PROTECTED] >> >> >> ------------------------------------------------------------------ >> --- To unsubscribe, e-mail: lucene-dev- >>[EMAIL PROTECTED] For additional commands, e-mail: >>[EMAIL PROTECTED] > > > -------------------------------------------------------------------- > - To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]