Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related to IR.
Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but would like a more indepth knowledge for practical purposes. Basically, the end result is that we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene. What do you think? k On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote: > Exactly. Luckily, since then I've learned a bit from lucene-dev > discussions and side IR readings, so some of the topics are making > more sense now. > > Otis > > --- Kelvin Tan <[EMAIL PROTECTED]> wrote: > >> Hi Otis, I was re-reading this whole theoretical thread about >> idf, scoring, normalization, etc from last Oct and couldn't help >> laughing out loud when I read your post, coz it summed up what I >> was thinking the whole time. I think its really great to have >> people like Chuck and Paul (Eshlot) around. I'm learning so much. >> >> k >> >> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote: >> >>> Hi Chuck, >>> >>> The relative lack of responses is not because there is no >>> interest, but probably because there are only a few people on >>> lucene-dev who can follow/understand every detail of your >>> proposal. I understand and hear you, but I have a hard time >>> 'visualizing' some of the formulas in your proposal. What you >>> are saying sounds right to me, but I don't have enough >>> theoretical knowledge to go one way or the other. >>> >>> Otis >>> >>> >>> --- Chuck Williams <[EMAIL PROTECTED]> wrote: >>> >>>> Hi everybody, >>>> >>>> Although there doesn't seem to be much interest in this I >>>> have one significant improvement to the below and a couple >>>> observations that clarify the situation. >>>> >>>> To illustrate the problem better normalization is intended to >>>> address, >>>> in my current application for BooleanQuery's of two terms, I >>>> frequently >>>> get a top score of 1.0 when only one of the terms is matched. >>>> 1.0 should indicate a "perfect match". I'd like set my UI up >>>> to present the >>>> results differently depending on how good the different >>>> results are (e.g., showing a visual indication of result >>>> quality, dropping the really bad results entirely, and >>>> segregating the good results from other >>>> only vaguely relevant results). To build this kind of >>>> "intelligence" into the UI, I certainly need to know whether >>>> my top result matched all >>>> query terms, or only half of them. I'd like to have the >>>> score tell me >>>> directly how good the matches are. The current normalization >>>> does not achieve this. >>>> >>>> The intent of a new normalization scheme is to preserve the >>>> current scoring in the following sense: the ratio of the nth >>>> result's score to >>>> the best result's score remains the same. Therefore, the >>>> only question >>>> is what normalization factor to apply to all scores. This >>>> reduces to a >>>> very specific question that determines the entire >>>> normalization -- what should the score of the best matching >>>> result be? >>>> >>>> The mechanism below has this property, i.e. it keeps the >>>> current score >>>> ratios, except that I removed one idf term for reasons >>>> covered earlier >>>> (this better reflects the current empirically best scoring >>>> algorithms). >>>> However, removing an idf is a completely separate issue. The >>>> improved >>>> normalization is independent of whether or not that change is >>>> made. >>>> >>>> For the central question of what the top score should be, >>>> upon reflection, I don't like the definition below. It >>>> defined the top score >>>> as (approximately) the percentage of query terms matched by >>>> the top scoring result. A better conceptual definition is to >>>> use a weighted average based on the boosts. I.e., downward >>>> propagate all boosts to the >>>> underlying terms (or phrases). Secifically, the "net boost" >>>> of a term >>>> is the product of the direct boost of the term and all boosts >>>> applied to >>>> encompassing clauses. Then the score of the top result >>>> becomes the sum >>>> of the net boosts of its matching terms divided by the sum of >>>> the net boosts of all query terms. >>>> >>>> This definition is a refinement of the original proposal >>>> below, and it >>>> could probably be further refined if some impact of the tf, >>>> idf and/or >>>> lengthNorm was desired in determining the top score. These >>>> other factors seems to be harder to normalize for, although >>>> I've thought of some simple approaches; e.g., assume the >>>> unmatched terms in the top result have values for these three >>>> factors that are the average of the >>>> values of the matched terms, then apply exactly the same >>>> concept of actual score divided by theorectical maximum >>>> score. That would eliminate any need to maintain maximum >>>> value statistics in the index. >>>> >>>> As an example of the simple boost-based normalization, for >>>> the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b -- >>>> > 3 c -- >>>> >>>>> 1 d --> 2 >>>>> >>>> So if a and b matched, but not c and d, in the top scoring >>>> result, its >>>> score would be 0.75. The normalizer would be 0.75/(current >>>> score except >>>> for the current normalization). This normalizer would be >>>> applied to all >>>> current scores (minus normalization) to create the normalized >>>> scores. >>>> >>>> For simple query (a b), if only one of the terms matched in >>>> the top result, then its score would be 0.5, vs. 1.0 or many >>>> other possible scores today. >>>> >>>> In addition to enabling more "intelligent" UI's that >>>> communicate the quality of results to end-users, the proposal >>>> below also extends the explain() mechanism to fully explain >>>> the final normalized score. However, that change is also >>>> independent -- it could be done with the current scoring. >>>> >>>> Am I the only one who would like to see better normalization >>>> in Lucene? Does anybody have a better approach? >>>> >>>> If you've read this far, thanks for indulging me on this. I >>>> would love >>>> to see this or something with similar properties in Lucene. >>>> I really just want to build my app, but as stated below would >>>> write and contribute this if there is interest in putting it >>>> in, and nobody else >>>> wants to write it. Please let me know what you think one way >>>> or the other. >>>> >>>> Thanks, >>>> >>>> Chuck >>>> >>>> >>>>> -----Original Message----- >>>>> From: Chuck Williams >>>>> Sent: Monday, October 18, 2004 7:04 PM >>>>> To: 'Lucene Developers List' >>>>> Subject: RE: idf and explain(), was Re: Search and Scoring >>>>> >>>>> Doug Cutting wrote: >>>>>> If this is a big issue for you, as it seems it is, please >>>>>> >>>> submit >>>> a >>>>> patch >>>>>> to optionally disable score normalization in Hits.java. >>>>>> >>>>> and: >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for >>>>>> >>>> each >>>>> document >>>>>> each time a document is added to the collection, since >>>>>> >>>> 'weight(t,d)' >>>>> is >>>>>> dependent on global term statistics. This is prohibitivly >>>>>> >>>> expensive. >>>>>> Research has also demonstrated that such cosine >>>>>> normalization >>>>>> >>>> gives >>>>>> somewhat inferior results (e.g., Singhal's pivoted length >>>>>> >>>>> normalization). >>>>> >>>>> I'm willing to write, test and contribute code to address >>>>> the normalization issue, i.e. to yield scores in [0, 1] >>>>> that are >>>>> >>>> meaningful >>>>> across searches. Unfortunately, this is considerably more >>>>> >>>> involved >>>> that >>>>> just optionally eliminating the current normalization in >>>>> Hits. >>>>> >>>> Before >>>>> undertaking this, I'd like to see if there is agreement in >>>>> >>>> principle >>>>> that this is a good idea, and that my specific proposal >>>>> below is >>>>> >>>> the >>>>> right way to go. Also, I'd like to make sure I've correctly >>>>> >>>> inferred >>>>> the constraints on writing code to be incorporated into >>>>> Lucene. >>>>> >>>>> After looking at this in more detail I agree that the >>>>> cosine normalization is not the way to go, because of both >>>>> efficiency >>>>> >>>> and >>>>> effectiveness considerations. A conceptual approach that >>>>> would >>>>> >>>> be >>>>> efficient, relatively easy to implement, and seems to have >>>>> at >>>>> >>>> least >>>>> reasonable behavior would be to define the top scoring >>>>> match to >>>>> >>>> have >>>> a >>>>> score roughly equal to the percentage of query terms it >>>>> matches >>>>> >>>> (its >>>>> "netCoord"). Scores below the top hit would be reduced >>>>> based on >>>>> >>>> the >>>>> ratio of their raw scores to the raw score of the top hit >>>>> >>>> (considering >>>>> all of the current score factors, except that I'd like to >>>>> remove >>>>> >>>> one >>>> of >>>>> the idf factors, as discussed earlier). >>>>> >>>>> For a couple simple cases: >>>>> a) the top match for a single term query would always have a >>>>> >>>> score >>>> of >>>>> 1.0, >>>>> b) the top scoring match for a BooleanQuery using >>>>> >>>> DefaultSimilarity >>>>> with all non-prohibited TermQuery clauses would have a >>>>> score of >>>>> >>>> m/n, >>>>> where the hit matches m of the n terms. >>>>> >>>>> This isn't optimal, but seems much better than the current >>>>> >>>> situation. >>>>> Consider two single-term queries, s and t. If s matches >>>>> more >>>>> >>>> strongly >>>>> than t in its top hit (e.g., a higher tf in a shorter >>>>> field), it >>>>> >>>> would >>>>> be best if the top score of s was greater score than top >>>>> score of >>>>> >>>> t. >>>>> But this kind of normalization would require keeping >>>>> additional statistics that so far as I know are not >>>>> currently in the index, >>>>> >>>> like >>>>> the maximum tf for terms and the minimum length for fields. >>>>> >>>> These >>>> could >>>>> be expensive to update on deletes. Also, normalizing by >>>>> such >>>>> >>>> factors >>>>> could yield lower than subjectively reasonable scores in >>>>> most >>>>> >>>> cases, >>>> so >>>>> it's not clear it would be better. >>>>> >>>>> The semantics above are at least clean, easy to understand, >>>>> and >>>>> >>>> support >>>>> what seems to me is the most important motivation to do >>>>> this: >>>>> >>>> allowing >>>>> an application to use simple thresholding to segregate >>>>> >>>> likely-to-be- >>>>> relevant hits from likely-to-be-irrelevant hits. >>>>> >>>>> More specifically, for a BooleanQuery of TermQuery's the >>>>> >>>> resulting >>>> score >>>>> functions would be: >>>>> >>>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn) >>>>> >>>>> baseScore(sbq, doc) = >>>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* >>>>> lengthNorm(tqi.term.field, doc) >>>>> >>>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore >>>>> >>>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit) >>>>> >>>>> score(sbq, doc) = rawScore * norm >>>>> >>>>> rawScore's can be computed in the Scorer.score() methods and >>>>> >>>> therefore >>>>> used to sort the hits (e.g., in the instance method for >>>>> collect() >>>>> >>>> in >>>> the >>>>> HitCollector in IndexSearcher.search()). If the top >>>>> scoring hit >>>>> >>>> does >>>>> not have the highest baseScore, then its score could be >>>>> less that >>>>> >>>> its >>>>> coord; this seems desirable. These formulas imply that no >>>>> result >>>>> >>>> score >>>>> can be larger than its coord, so if coord is well-defined >>>>> (always between 0 and 1) then all results will be >>>>> normalized between 0 >>>>> >>>> and >>>> 1. >>>> >>>>> In general, the netCoord, which takes the place of coord in >>>>> the >>>>> >>>> simple >>>>> case above, needs to be defined for any query. >>>>> Conceptually, >>>>> >>>> this >>>>> should be the total percentage of query terms matched by the >>>>> >>>> document. >>>>> It must be recursively computable from the subquery >>>>> structure and overridable for specific Query types (e.g., >>>>> to support >>>>> >>>> specialized >>>>> coords, like one that is always 1.0 as is useful in multi- >>>>> field searching). Suitable default definitions for >>>>> TermQuery and >>>>> >>>> BooleanQuery >>>>> are: >>>>> >>>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise >>>>> >>>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) * >>>>> >>>> ci.netCoord >>>> >>>>> This is not quite percentage of terms matched; e.g., >>>>> consider a BooleanQuery with two clauses, one of which is a >>>>> BooleanQuery of >>>>> >>>> 3 >>>> terms >>>>> and the other which is just a term. However, it doesn't >>>>> seem to >>>>> >>>> be >>>>> unreasonable for a BooleanQuery to state that its clauses >>>>> are >>>>> >>>> equally >>>>> important, and this is consistent with the current coord >>>>> >>>> behavior. >>>>> BooleanQuery.netCoord could be overridden for special >>>>> cases, like >>>>> >>>> the >>>>> pure disjunction I use in my app for field expansions: >>>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord >>>>> >>>>> Implementing this would proceed along these lines: 1. For >>>>> backwards compatibility, add some kind of newScoring >>>>> >>>> boolean >>>>> setting. >>>>> 2. Update all of these places to behave as indicated if >>>>> >>>> newScoring: >>>>> a. Change Query.weight() to not do any normalization (no >>>>> >>>> call >>>> to >>>>> sumOfSquaredWeights(), Similarity.queryNorm() or >>>>> normalize()). b. Update all Query.weight classes to set >>>>> their value >>>>> >>>> according >>>> to >>>>> the terms in the score formula above that don't involve the >>>>> >>>> document >>>>> (e.g., boost*idf for TermQuery). >>>>> c. Add suitable netCoord definitions to all Scorer >>>>> classes. d. Update all Scorer.score() methods to compute >>>>> the rawScore >>>>> >>>> as >>>>> above. >>>>> e. Add the maximum baseScore as a field kept on TopDocs, >>>>> >>>> computed >>>>> in the HitCollector's. >>>>> f. Change the normalization in Hits to always divide every >>>>> >>>> raw >>>>> score by the maximum baseScore. >>>>> g. Update all of the current explain() methods to be >>>>> >>>> consistent >>>>> with this scoring, and to either report the rawScore, or to >>>>> >>>> report >>>> the >>>>> final score if the normalization factor is provided. h. >>>>> Add Hits.explain() (or better extend Searcher so that it >>>>> >>>> keeps >>>>> the Hits for use in Searcher.explain()) to call the new >>>>> explain variation with the normalization factor so that >>>>> final scores are >>>>> >>>> fully >>>>> explained. >>>>> >>>>> If this seems like a good idea, please let me know. I'm >>>>> sure >>>>> >>>> there >>>> are >>>>> details I've missed that would come out during coding and >>>>> >>>> testing. >>>> Also, >>>>> the value of this is dependent on how reasonable the final >>>>> scores >>>>> >>>> look, >>>>> which is hard to tell for sure until it is working. >>>>> >>>>> The coding standards for Lucene seem reasonably clear from >>>>> the >>>>> >>>> source >>>>> code I've read. I could use just simple Java so there >>>>> shouldn't >>>>> >>>> be >>>> any >>>>> significant JVM dependencies. The above should be fully >>>>> backward compatible due to the newScoring flag. >>>>> >>>>> On another note, I had to remove the German analyzer in my >>>>> >>>> current >>>> 1.4.2 >>>>> source configuration because GermanStemmer failed to >>>>> compile due >>>>> >>>> to >>>> what >>>>> are apparently Unicode character constants that I've now >>>>> got as >>>>> >>>> illegal >>>>> two-character character constants. This is presumably an >>>>> >>>> encoding >>>>> problem somewhere that I could track down. It's not >>>>> important, >>>>> >>>> but >>>> if >>>>> the answer is obvious to any of you, I'd appreciate the >>>>> quick >>>>> >>>> tip. >>>> >>>>> Thanks, >>>>> >>>>> Chuck >>>>> >>>>>> -----Original Message----- >>>>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: >>>>>> Monday, October 18, 2004 9:44 AM To: Lucene Developers >>>>>> List Subject: Re: idf and explain(), was Re: Search and >>>>>> Scoring >>>>>> >>>>>> Chuck Williams wrote: >>>>>>> That's a good point on how the standard vector space >>>>>>> inner >>>>>>> >>>> product >>>>>>> similarity measure does imply that the idf is squared >>>>>>> >>>> relative >>>> to >>>>> the >>>>>>> document tf. Even having been aware of this formula >>>>>>> for a >>>>>>> >>>> long >>>>> time, >>>>>>> this particular implication never occurred to me. Do >>>>>>> you >>>>>>> >>>> know >>>> if >>>>>>> anybody has done precision/recall or other relevancy >>>>>>> >>>> empirical >>>>>>> measurements comparing this vs. a model that does not >>>>>>> >>>> square >>>> idf? >>>> >>>>>> No, not that I know of. >>>>>> >>>>>>> Regarding normalization, the normalization in Hits does >>>>>>> not >>>>>>> >>>> have >>>>> very >>>>>>> nice properties. Due to the > 1.0 threshold check, it >>>>>>> >>>> loses >>>>>>> information, and it arbitrarily defines the highest >>>>>>> scoring >>>>>>> >>>> result >>>>> in >>>>>>> any list that generates scores above 1.0 as a perfect >>>>>>> >>>> match. >>>> It >>>>> would >>>>>>> be nice if score values were meaningful independent of >>>>>>> >>>> searches, >>>>> e.g., >>>>>>> if "0.6" meant the same quality of retrieval >>>>>>> independent of >>>>>>> >>>> what >>>>>> search >>>>>>> was done. This would allow, for example, sites to use >>>>>>> a a >>>>>>> >>>> simple >>>>>>> quality threshold to only show results that were "good >>>>>>> >>>> enough". >>>>> At my >>>>>>> last company (I was President and head of engineering >>>>>>> for >>>>>>> >>>> InQuira), >>>>> we >>>>>>> found this to be important to many customers. >>>>>>> >>>>>> If this is a big issue for you, as it seems it is, please >>>>>> >>>> submit >>>> a >>>>> patch >>>>>> to optionally disable score normalization in Hits.java. >>>>>> >>>>>>> The standard vector space similarity measure includes >>>>>>> >>>>> normalization by >>>>>>> the product of the norms of the vectors, i.e.: >>>>>>> >>>>>>> score(d,q) = sum over t of ( weight(t,q) * weight(t,d) >>>>>>> ) >>>>>>> >>>> / >>>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t) >>>>>>> >>>>>> weight(t,d)^2) ] >>>>>> >>>>>>> This makes the score a cosine, which since the values >>>>>>> are >>>>>>> >>>> all >>>>> positive, >>>>>>> forces it to be in [0, 1]. The sumOfSquares() >>>>>>> >>>> normalization >>>> in >>>>> Lucene >>>>>>> does not fully implement this. Is there a specific >>>>>>> reason >>>>>>> >>>> for >>>>> that? >>>>> >>>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for >>>>>> >>>> each >>>>> document >>>>>> each time a document is added to the collection, since >>>>>> >>>> 'weight(t,d)' >>>>> is >>>>>> dependent on global term statistics. This is prohibitivly >>>>>> >>>> expensive. >>>>>> Research has also demonstrated that such cosine >>>>>> normalization >>>>>> >>>> gives >>>>>> somewhat inferior results (e.g., Singhal's pivoted length >>>>>> >>>>> normalization). >>>>> >>>>>>> Re. explain(), I don't see a downside to extending it >>>>>>> show >>>>>>> >>>> the >>>>> final >>>>>>> normalization in Hits. It could still show the raw >>>>>>> score >>>>>>> >>>> just >>>>> prior >>>>>> to >>>>>>> that normalization. >>>>>>> >>>>>> In order to normalize scores to 1.0 one must know the >>>>>> maximum >>>>>> >>>> score. >>>>>> Explain only computes the score for a single document, and >>>>>> >>>> the >>>>> maximum >>>>>> score is not known. >>>>>> >>>>>>> Although I think it would be best to have a >>>>>>> normalization that would render scores comparable across >>>>>>> >>>> searches. >>>> >>>>>> Then please submit a patch. Lucene doesn't change on its >>>>>> >>>> own. >>>> >>>>>> Doug >>>>>> >>>>>> >>>> -------------------------------------------------------------- >>>> ---- -- >>>> >>>>> - >>>>>> To unsubscribe, e-mail: >>>>[EMAIL PROTECTED] >>>>>> For additional commands, e-mail: >>>>[EMAIL PROTECTED] >>>> >>>> >>>> -------------------------------------------------------------- >>>> ---- --- To unsubscribe, e-mail: lucene-dev- >>>>[EMAIL PROTECTED] For additional commands, e- >>>> mail: [EMAIL PROTECTED] >>> >>> > -------------------------------------------------------------------- > >>> - To unsubscribe, e-mail: lucene-dev- >>>[EMAIL PROTECTED] For additional commands, e-mail: >>>[EMAIL PROTECTED] >> >> >> ------------------------------------------------------------------ >> --- To unsubscribe, e-mail: lucene-dev- >>[EMAIL PROTECTED] For additional commands, e-mail: >>[EMAIL PROTECTED] > > > -------------------------------------------------------------------- > - To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]