Exactly. Luckily, since then I've learned a bit from lucene-dev discussions and side IR readings, so some of the topics are making more sense now.
Otis --- Kelvin Tan <[EMAIL PROTECTED]> wrote: > Hi Otis, I was re-reading this whole theoretical thread about idf, > scoring, normalization, etc from last Oct and couldn't help laughing > out loud when I read your post, coz it summed up what I was thinking > the whole time. I think its really great to have people like Chuck > and Paul (Eshlot) around. I'm learning so much. > > k > > On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote: > > Hi Chuck, > > > > The relative lack of responses is not because there is no interest, > > but probably because there are only a few people on lucene-dev who > > can follow/understand every detail of your proposal. I understand > > and hear you, but I have a hard time 'visualizing' some of the > > formulas in your proposal. What you are saying sounds right to me, > > but I don't have enough theoretical knowledge to go one way or the > > other. > > > > Otis > > > > > > --- Chuck Williams <[EMAIL PROTECTED]> wrote: > > > >> Hi everybody, > >> > >> Although there doesn't seem to be much interest in this I have > >> one significant improvement to the below and a couple > >> observations that clarify the situation. > >> > >> To illustrate the problem better normalization is intended to > >> address, > >> in my current application for BooleanQuery's of two terms, I > >> frequently > >> get a top score of 1.0 when only one of the terms is matched. > >> 1.0 should indicate a "perfect match". I'd like set my UI up to > >> present the > >> results differently depending on how good the different results > >> are (e.g., showing a visual indication of result quality, > >> dropping the really bad results entirely, and segregating the > >> good results from other > >> only vaguely relevant results). To build this kind of > >> "intelligence" into the UI, I certainly need to know whether my > >> top result matched all > >> query terms, or only half of them. I'd like to have the score > >> tell me > >> directly how good the matches are. The current normalization > >> does not achieve this. > >> > >> The intent of a new normalization scheme is to preserve the > >> current scoring in the following sense: the ratio of the nth > >> result's score to > >> the best result's score remains the same. Therefore, the only > >> question > >> is what normalization factor to apply to all scores. This > >> reduces to a > >> very specific question that determines the entire normalization -- > >> what should the score of the best matching result be? > >> > >> The mechanism below has this property, i.e. it keeps the current > >> score > >> ratios, except that I removed one idf term for reasons covered > >> earlier > >> (this better reflects the current empirically best scoring > >> algorithms). > >> However, removing an idf is a completely separate issue. The > >> improved > >> normalization is independent of whether or not that change is > >> made. > >> > >> For the central question of what the top score should be, upon > >> reflection, I don't like the definition below. It defined the > >> top score > >> as (approximately) the percentage of query terms matched by the > >> top scoring result. A better conceptual definition is to use a > >> weighted average based on the boosts. I.e., downward propagate > >> all boosts to the > >> underlying terms (or phrases). Secifically, the "net boost" of a > >> term > >> is the product of the direct boost of the term and all boosts > >> applied to > >> encompassing clauses. Then the score of the top result becomes > >> the sum > >> of the net boosts of its matching terms divided by the sum of the > >> net boosts of all query terms. > >> > >> This definition is a refinement of the original proposal below, > >> and it > >> could probably be further refined if some impact of the tf, idf > >> and/or > >> lengthNorm was desired in determining the top score. These other > >> factors seems to be harder to normalize for, although I've > >> thought of some simple approaches; e.g., assume the unmatched > >> terms in the top result have values for these three factors that > >> are the average of the > >> values of the matched terms, then apply exactly the same concept > >> of actual score divided by theorectical maximum score. That > >> would eliminate any need to maintain maximum value statistics in > >> the index. > >> > >> As an example of the simple boost-based normalization, for the > >> query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --> 3 c -- > >> > 1 d --> 2 > >> > >> So if a and b matched, but not c and d, in the top scoring > >> result, its > >> score would be 0.75. The normalizer would be 0.75/(current score > >> except > >> for the current normalization). This normalizer would be applied > >> to all > >> current scores (minus normalization) to create the normalized > >> scores. > >> > >> For simple query (a b), if only one of the terms matched in the > >> top result, then its score would be 0.5, vs. 1.0 or many other > >> possible scores today. > >> > >> In addition to enabling more "intelligent" UI's that communicate > >> the quality of results to end-users, the proposal below also > >> extends the explain() mechanism to fully explain the final > >> normalized score. However, that change is also independent -- it > >> could be done with the current scoring. > >> > >> Am I the only one who would like to see better normalization in > >> Lucene? Does anybody have a better approach? > >> > >> If you've read this far, thanks for indulging me on this. I > >> would love > >> to see this or something with similar properties in Lucene. I > >> really just want to build my app, but as stated below would write > >> and contribute this if there is interest in putting it in, and > >> nobody else > >> wants to write it. Please let me know what you think one way or > >> the other. > >> > >> Thanks, > >> > >> Chuck > >> > >> > >>> -----Original Message----- > >>> From: Chuck Williams > >>> Sent: Monday, October 18, 2004 7:04 PM > >>> To: 'Lucene Developers List' > >>> Subject: RE: idf and explain(), was Re: Search and Scoring > >>> > >>> Doug Cutting wrote: > >>>> If this is a big issue for you, as it seems it is, please > >>>> > >> submit > >> a > >>> patch > >>>> to optionally disable score normalization in Hits.java. > >>>> > >>> and: > >>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for > >>>> > >> each > >>> document > >>>> each time a document is added to the collection, since > >>>> > >> 'weight(t,d)' > >>> is > >>>> dependent on global term statistics. This is prohibitivly > >>>> > >> expensive. > >>>> Research has also demonstrated that such cosine normalization > >>>> > >> gives > >>>> somewhat inferior results (e.g., Singhal's pivoted length > >>>> > >>> normalization). > >>> > >>> I'm willing to write, test and contribute code to address the > >>> normalization issue, i.e. to yield scores in [0, 1] that are > >>> > >> meaningful > >>> across searches. Unfortunately, this is considerably more > >>> > >> involved > >> that > >>> just optionally eliminating the current normalization in Hits. > >>> > >> Before > >>> undertaking this, I'd like to see if there is agreement in > >>> > >> principle > >>> that this is a good idea, and that my specific proposal below is > >>> > >> the > >>> right way to go. Also, I'd like to make sure I've correctly > >>> > >> inferred > >>> the constraints on writing code to be incorporated into Lucene. > >>> > >>> After looking at this in more detail I agree that the cosine > >>> normalization is not the way to go, because of both efficiency > >>> > >> and > >>> effectiveness considerations. A conceptual approach that would > >>> > >> be > >>> efficient, relatively easy to implement, and seems to have at > >>> > >> least > >>> reasonable behavior would be to define the top scoring match to > >>> > >> have > >> a > >>> score roughly equal to the percentage of query terms it matches > >>> > >> (its > >>> "netCoord"). Scores below the top hit would be reduced based on > >>> > >> the > >>> ratio of their raw scores to the raw score of the top hit > >>> > >> (considering > >>> all of the current score factors, except that I'd like to remove > >>> > >> one > >> of > >>> the idf factors, as discussed earlier). > >>> > >>> For a couple simple cases: > >>> a) the top match for a single term query would always have a > >>> > >> score > >> of > >>> 1.0, > >>> b) the top scoring match for a BooleanQuery using > >>> > >> DefaultSimilarity > >>> with all non-prohibited TermQuery clauses would have a score of > >>> > >> m/n, > >>> where the hit matches m of the n terms. > >>> > >>> This isn't optimal, but seems much better than the current > >>> > >> situation. > >>> Consider two single-term queries, s and t. If s matches more > >>> > >> strongly > >>> than t in its top hit (e.g., a higher tf in a shorter field), it > >>> > >> would > >>> be best if the top score of s was greater score than top score > >>> of > >>> > >> t. > >>> But this kind of normalization would require keeping additional > >>> statistics that so far as I know are not currently in the index, > >>> > >> like > >>> the maximum tf for terms and the minimum length for fields. > >>> > >> These > >> could > >>> be expensive to update on deletes. Also, normalizing by such > >>> > >> factors > >>> could yield lower than subjectively reasonable scores in most > >>> > >> cases, > >> so > >>> it's not clear it would be better. > >>> > >>> The semantics above are at least clean, easy to understand, and > >>> > >> support > >>> what seems to me is the most important motivation to do this: > >>> > >> allowing > >>> an application to use simple thresholding to segregate > >>> > >> likely-to-be- > >>> relevant hits from likely-to-be-irrelevant hits. > >>> > >>> More specifically, for a BooleanQuery of TermQuery's the > >>> > >> resulting > >> score > >>> functions would be: > >>> > >>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn) > >>> > >>> baseScore(sbq, doc) = > >>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* > >>> lengthNorm(tqi.term.field, doc) > >>> > >>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore > >>> > >>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) > >>> > >>> score(sbq, doc) = rawScore * norm > >>> > >>> rawScore's can be computed in the Scorer.score() methods and > >>> > >> therefore > >>> used to sort the hits (e.g., in the instance method for > >>> collect() > >>> > >> in > >> the > >>> HitCollector in IndexSearcher.search()). If the top scoring hit > >>> > >> does > >>> not have the highest baseScore, then its score could be less > >>> that > >>> > >> its > >>> coord; this seems desirable. These formulas imply that no > >>> result > >>> > >> score > >>> can be larger than its coord, so if coord is well-defined > >>> (always between 0 and 1) then all results will be normalized > >>> between 0 > >>> > >> and > >> 1. > >> > >>> In general, the netCoord, which takes the place of coord in the > >>> > >> simple > >>> case above, needs to be defined for any query. Conceptually, > >>> > >> this > >>> should be the total percentage of query terms matched by the > >>> > >> document. > >>> It must be recursively computable from the subquery structure > >>> and overridable for specific Query types (e.g., to support > >>> > >> specialized > >>> coords, like one that is always 1.0 as is useful in multi-field > >>> searching). Suitable default definitions for TermQuery and > >>> > >> BooleanQuery > >>> are: > >>> > >>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise > >>> > >>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * > >>> > >> ci.netCoord > >> > >>> This is not quite percentage of terms matched; e.g., consider a > >>> BooleanQuery with two clauses, one of which is a BooleanQuery of > >>> > >> 3 > >> terms > >>> and the other which is just a term. However, it doesn't seem to > >>> > >> be > >>> unreasonable for a BooleanQuery to state that its clauses are > >>> > >> equally > >>> important, and this is consistent with the current coord > >>> > >> behavior. > >>> BooleanQuery.netCoord could be overridden for special cases, > >>> like > >>> > >> the > >>> pure disjunction I use in my app for field expansions: > >>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord > >>> > >>> Implementing this would proceed along these lines: 1. For > >>> backwards compatibility, add some kind of newScoring > >>> > >> boolean > >>> setting. > >>> 2. Update all of these places to behave as indicated if > >>> > >> newScoring: > >>> a. Change Query.weight() to not do any normalization (no > >>> > >> call > >> to > >>> sumOfSquaredWeights(), Similarity.queryNorm() or normalize()). > >>> b. Update all Query.weight classes to set their value > >>> > >> according > >> to > >>> the terms in the score formula above that don't involve the > >>> > >> document > >>> (e.g., boost*idf for TermQuery). > >>> c. Add suitable netCoord definitions to all Scorer classes. d. > >>> Update all Scorer.score() methods to compute the rawScore > >>> > >> as > >>> above. > >>> e. Add the maximum baseScore as a field kept on TopDocs, > >>> > >> computed > >>> in the HitCollector's. > >>> f. Change the normalization in Hits to always divide every > >>> > >> raw > >>> score by the maximum baseScore. > >>> g. Update all of the current explain() methods to be > >>> > >> consistent > >>> with this scoring, and to either report the rawScore, or to > >>> > >> report > >> the > >>> final score if the normalization factor is provided. h. Add > >>> Hits.explain() (or better extend Searcher so that it > >>> > >> keeps > >>> the Hits for use in Searcher.explain()) to call the new explain > >>> variation with the normalization factor so that final scores are > >>> > >> fully > >>> explained. > >>> > >>> If this seems like a good idea, please let me know. I'm sure > >>> > >> there > >> are > >>> details I've missed that would come out during coding and > >>> > >> testing. > >> Also, > >>> the value of this is dependent on how reasonable the final > >>> scores > >>> > >> look, > >>> which is hard to tell for sure until it is working. > >>> > >>> The coding standards for Lucene seem reasonably clear from the > >>> > >> source > >>> code I've read. I could use just simple Java so there shouldn't > >>> > >> be > >> any > >>> significant JVM dependencies. The above should be fully > >>> backward compatible due to the newScoring flag. > >>> > >>> On another note, I had to remove the German analyzer in my > >>> > >> current > >> 1.4.2 > >>> source configuration because GermanStemmer failed to compile due > >>> > >> to > >> what > >>> are apparently Unicode character constants that I've now got as > >>> > >> illegal > >>> two-character character constants. This is presumably an > >>> > >> encoding > >>> problem somewhere that I could track down. It's not important, > >>> > >> but > >> if > >>> the answer is obvious to any of you, I'd appreciate the quick > >>> > >> tip. > >> > >>> Thanks, > >>> > >>> Chuck > >>> > >>>> -----Original Message----- > >>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, > >>>> October 18, 2004 9:44 AM To: Lucene Developers List > >>>> Subject: Re: idf and explain(), was Re: Search and Scoring > >>>> > >>>> Chuck Williams wrote: > >>>>> That's a good point on how the standard vector space inner > >>>>> > >> product > >>>>> similarity measure does imply that the idf is squared > >>>>> > >> relative > >> to > >>> the > >>>>> document tf. Even having been aware of this formula for a > >>>>> > >> long > >>> time, > >>>>> this particular implication never occurred to me. Do you > >>>>> > >> know > >> if > >>>>> anybody has done precision/recall or other relevancy > >>>>> > >> empirical > >>>>> measurements comparing this vs. a model that does not > >>>>> > >> square > >> idf? > >> > >>>> No, not that I know of. > >>>> > >>>>> Regarding normalization, the normalization in Hits does not > >>>>> > >> have > >>> very > >>>>> nice properties. Due to the > 1.0 threshold check, it > >>>>> > >> loses > >>>>> information, and it arbitrarily defines the highest scoring > >>>>> > >> result > >>> in > >>>>> any list that generates scores above 1.0 as a perfect > >>>>> > >> match. > >> It > >>> would > >>>>> be nice if score values were meaningful independent of > >>>>> > >> searches, > >>> e.g., > >>>>> if "0.6" meant the same quality of retrieval independent of > >>>>> > >> what > >>>> search > >>>>> was done. This would allow, for example, sites to use a a > >>>>> > >> simple > >>>>> quality threshold to only show results that were "good > >>>>> > >> enough". > >>> At my > >>>>> last company (I was President and head of engineering for > >>>>> > >> InQuira), > >>> we > >>>>> found this to be important to many customers. > >>>>> > >>>> If this is a big issue for you, as it seems it is, please > >>>> > >> submit > >> a > >>> patch > >>>> to optionally disable score normalization in Hits.java. > >>>> > >>>>> The standard vector space similarity measure includes > >>>>> > >>> normalization by > >>>>> the product of the norms of the vectors, i.e.: > >>>>> > >>>>> score(d,q) = sum over t of ( weight(t,q) * weight(t,d) ) > >>>>> > >> / > >>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t) > >>>>> > >>>> weight(t,d)^2) ] > >>>> > >>>>> This makes the score a cosine, which since the values are > >>>>> > >> all > >>> positive, > >>>>> forces it to be in [0, 1]. The sumOfSquares() > >>>>> > >> normalization > >> in > >>> Lucene > >>>>> does not fully implement this. Is there a specific reason > >>>>> > >> for > >>> that? > >>> > >>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for > >>>> > >> each > >>> document > >>>> each time a document is added to the collection, since > >>>> > >> 'weight(t,d)' > >>> is > >>>> dependent on global term statistics. This is prohibitivly > >>>> > >> expensive. > >>>> Research has also demonstrated that such cosine normalization > >>>> > >> gives > >>>> somewhat inferior results (e.g., Singhal's pivoted length > >>>> > >>> normalization). > >>> > >>>>> Re. explain(), I don't see a downside to extending it show > >>>>> > >> the > >>> final > >>>>> normalization in Hits. It could still show the raw score > >>>>> > >> just > >>> prior > >>>> to > >>>>> that normalization. > >>>>> > >>>> In order to normalize scores to 1.0 one must know the maximum > >>>> > >> score. > >>>> Explain only computes the score for a single document, and > >>>> > >> the > >>> maximum > >>>> score is not known. > >>>> > >>>>> Although I think it would be best to have a normalization > >>>>> that would render scores comparable across > >>>>> > >> searches. > >> > >>>> Then please submit a patch. Lucene doesn't change on its > >>>> > >> own. > >> > >>>> Doug > >>>> > >>>> > >> ------------------------------------------------------------------ > >> -- > >> > >>> - > >>>> To unsubscribe, e-mail: > >> [EMAIL PROTECTED] > >>>> For additional commands, e-mail: > >> [EMAIL PROTECTED] > >> > >> > >> ------------------------------------------------------------------ > >> --- To unsubscribe, e-mail: lucene-dev- > >> [EMAIL PROTECTED] For additional commands, e-mail: > >> [EMAIL PROTECTED] > > > > > > -------------------------------------------------------------------- > > - To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]