Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Kelvin Tan Sat, 05 Feb 2005 15:32:45 -0800

Hi Otis, I was re-reading this whole theoretical thread about idf, scoring, 
normalization, etc from last Oct and couldn't help laughing out loud when I 
read your post, coz it summed up what I was thinking the whole time. I think 
its really great to have people like Chuck and Paul (Eshlot) around. I'm 
learning so much.


k

On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
> Hi Chuck,
>
> The relative lack of responses is not because there is no interest,
> but probably because there are only a few people on lucene-dev who
> can follow/understand every detail of your proposal.  I understand
> and hear you, but I have a hard time 'visualizing' some of the
> formulas in your proposal.  What you are saying sounds right to me,
> but I don't have enough theoretical knowledge to go one way or the
> other.
>
> Otis
>
>
> --- Chuck Williams <[EMAIL PROTECTED]> wrote:
>
>> Hi everybody,
>>
>> Although there doesn't seem to be much interest in this I have
>> one significant improvement to the below and a couple
>> observations that clarify the situation.
>>
>> To illustrate the problem better normalization is intended to
>> address,
>> in my current application for BooleanQuery's of two terms, I
>> frequently
>> get a top score of 1.0 when only one of the terms is matched.
>> 1.0 should indicate a "perfect match".  I'd like set my UI up to
>> present the
>> results differently depending on how good the different results
>> are (e.g., showing a visual indication of result quality,
>> dropping the really bad results entirely, and segregating the
>> good results from other
>> only vaguely relevant results).  To build this kind of
>> "intelligence" into the UI, I certainly need to know whether my
>> top result matched all
>> query terms, or only half of them.  I'd like to have the score
>> tell me
>> directly how good the matches are.  The current normalization
>> does not achieve this.
>>
>> The intent of a new normalization scheme is to preserve the
>> current scoring in the following sense:  the ratio of the nth
>> result's score to
>> the best result's score remains the same.  Therefore, the only
>> question
>> is what normalization factor to apply to all scores.  This
>> reduces to a
>> very specific question that determines the entire normalization --
>>  what should the score of the best matching result be?
>>
>> The mechanism below has this property, i.e. it keeps the current
>> score
>> ratios, except that I removed one idf term for reasons covered
>> earlier
>> (this better reflects the current empirically best scoring
>> algorithms).
>> However, removing an idf is a completely separate issue.  The
>> improved
>> normalization is independent of whether or not that change is
>> made.
>>
>> For the central question of what the top score should be, upon
>> reflection, I don't like the definition below.  It defined the
>> top score
>> as (approximately) the percentage of query terms matched by the
>> top scoring result.  A better conceptual definition is to use a
>> weighted average based on the boosts.  I.e., downward propagate
>> all boosts to the
>> underlying terms (or phrases).  Secifically, the "net boost" of a
>> term
>> is the product of the direct boost of the term and all boosts
>> applied to
>> encompassing clauses.  Then the score of the top result becomes
>> the sum
>> of the net boosts of its matching terms divided by the sum of the
>> net boosts of all query terms.
>>
>> This definition is a refinement of the original proposal below,
>> and it
>> could probably be further refined if some impact of the tf, idf
>> and/or
>> lengthNorm was desired in determining the top score.  These other
>> factors seems to be harder to normalize for, although I've
>> thought of some simple approaches; e.g., assume the unmatched
>> terms in the top result have values for these three factors that
>> are the average of the
>> values of the matched terms, then apply exactly the same concept
>> of actual score divided by theorectical maximum score.  That
>> would eliminate any need to maintain maximum value statistics in
>> the index.
>>
>> As an example of the simple boost-based normalization, for the
>> query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --> 3 c --
>> > 1 d --> 2
>>
>> So if a and b matched, but not c and d, in the top scoring
>> result, its
>> score would be 0.75.  The normalizer would be 0.75/(current score
>> except
>> for the current normalization).  This normalizer would be applied
>> to all
>> current scores (minus normalization) to create the normalized
>> scores.
>>
>> For simple query (a b), if only one of the terms matched in the
>> top result, then its score would be 0.5, vs. 1.0 or many other
>> possible scores today.
>>
>> In addition to enabling more "intelligent" UI's that communicate
>> the quality of results to end-users, the proposal below also
>> extends the explain() mechanism to fully explain the final
>> normalized score. However, that change is also independent -- it
>> could be done with the current scoring.
>>
>> Am I the only one who would like to see better normalization in
>> Lucene? Does anybody have a better approach?
>>
>> If you've read this far, thanks for indulging me on this.  I
>> would love
>> to see this or something with similar properties in Lucene.  I
>> really just want to build my app, but as stated below would write
>> and contribute this if there is interest in putting it in, and
>> nobody else
>> wants to write it.  Please let me know what you think one way or
>> the other.
>>
>> Thanks,
>>
>> Chuck
>>
>>
>>> -----Original Message-----
>>> From: Chuck Williams
>>> Sent: Monday, October 18, 2004 7:04 PM
>>> To: 'Lucene Developers List'
>>> Subject: RE: idf and explain(), was Re: Search and Scoring
>>>
>>> Doug Cutting wrote:
>>>> If this is a big issue for you, as it seems it is, please
>>>>
>> submit
>> a
>>> patch
>>>> to optionally disable score normalization in Hits.java.
>>>>
>>> and:
>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>
>> each
>>> document
>>>> each time a document is added to the collection, since
>>>>
>> 'weight(t,d)'
>>> is
>>>> dependent on global term statistics.  This is prohibitivly
>>>>
>> expensive.
>>>> Research has also demonstrated that such cosine normalization
>>>>
>> gives
>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>
>>> normalization).
>>>
>>> I'm willing to write, test and contribute code to address the
>>> normalization issue, i.e. to yield scores in [0, 1] that are
>>>
>> meaningful
>>> across searches.  Unfortunately, this is considerably more
>>>
>> involved
>> that
>>> just optionally eliminating the current normalization in Hits.
>>>
>> Before
>>> undertaking this, I'd like to see if there is agreement in
>>>
>> principle
>>> that this is a good idea, and that my specific proposal below is
>>>
>> the
>>> right way to go.  Also, I'd like to make sure I've correctly
>>>
>> inferred
>>> the constraints on writing code to be incorporated into Lucene.
>>>
>>> After looking at this in more detail I agree that the cosine
>>> normalization is not the way to go, because of both efficiency
>>>
>> and
>>> effectiveness considerations.  A conceptual approach that would
>>>
>> be
>>> efficient, relatively easy to implement, and seems to have at
>>>
>> least
>>> reasonable behavior would be to define the top scoring match to
>>>
>> have
>> a
>>> score roughly equal to the percentage of query terms it matches
>>>
>> (its
>>> "netCoord").  Scores below the top hit would be reduced based on
>>>
>> the
>>> ratio of their raw scores to the raw score of the top hit
>>>
>> (considering
>>> all of the current score factors, except that I'd like to remove
>>>
>> one
>> of
>>> the idf factors, as discussed earlier).
>>>
>>> For a couple simple cases:
>>> a) the top match for a single term query would always have a
>>>
>> score
>> of
>>> 1.0,
>>> b) the top scoring match for a BooleanQuery using
>>>
>> DefaultSimilarity
>>> with all non-prohibited TermQuery clauses would have a score of
>>>
>> m/n,
>>> where the hit matches m of the n terms.
>>>
>>> This isn't optimal, but seems much better than the current
>>>
>> situation.
>>> Consider two single-term queries, s and t.  If s matches more
>>>
>> strongly
>>> than t in its top hit (e.g., a higher tf in a shorter field), it
>>>
>> would
>>> be best if the top score of s was greater score than top score
>>> of
>>>
>> t.
>>> But this kind of normalization would require keeping additional
>>> statistics that so far as I know are not currently in the index,
>>>
>> like
>>> the maximum tf for terms and the minimum length for fields.
>>>
>> These
>> could
>>> be expensive to update on deletes.  Also, normalizing by such
>>>
>> factors
>>> could yield lower than subjectively reasonable scores in most
>>>
>> cases,
>> so
>>> it's not clear it would be better.
>>>
>>> The semantics above are at least clean, easy to understand, and
>>>
>> support
>>> what seems to me is the most important motivation to do this:
>>>
>> allowing
>>> an application to use simple thresholding to segregate
>>>
>> likely-to-be-
>>> relevant hits from likely-to-be-irrelevant hits.
>>>
>>> More specifically, for a BooleanQuery of TermQuery's the
>>>
>> resulting
>> score
>>> functions would be:
>>>
>>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>>>
>>> baseScore(sbq, doc) =
>>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>>> lengthNorm(tqi.term.field, doc)
>>>
>>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>>>
>>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>>>
>>> score(sbq, doc) = rawScore * norm
>>>
>>> rawScore's can be computed in the Scorer.score() methods and
>>>
>> therefore
>>> used to sort the hits (e.g., in the instance method for
>>> collect()
>>>
>> in
>> the
>>> HitCollector in IndexSearcher.search()).  If the top scoring hit
>>>
>> does
>>> not have the highest baseScore, then its score could be less
>>> that
>>>
>> its
>>> coord; this seems desirable.  These formulas imply that no
>>> result
>>>
>> score
>>> can be larger than its coord, so if coord is well-defined
>>> (always between 0 and 1) then all results will be normalized
>>> between 0
>>>
>> and
>> 1.
>>
>>> In general, the netCoord, which takes the place of coord in the
>>>
>> simple
>>> case above, needs to be defined for any query.  Conceptually,
>>>
>> this
>>> should be the total percentage of query terms matched by the
>>>
>> document.
>>> It must be recursively computable from the subquery structure
>>> and overridable for specific Query types (e.g., to support
>>>
>> specialized
>>> coords, like one that is always 1.0 as is useful in multi-field
>>> searching).  Suitable default definitions for TermQuery and
>>>
>> BooleanQuery
>>> are:
>>>
>>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>>>
>>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>>>
>> ci.netCoord
>>
>>> This is not quite percentage of terms matched; e.g., consider a
>>> BooleanQuery with two clauses, one of which is a BooleanQuery of
>>>
>> 3
>> terms
>>> and the other which is just a term.  However, it doesn't seem to
>>>
>> be
>>> unreasonable for a BooleanQuery to state that its clauses are
>>>
>> equally
>>> important, and this is consistent with the current coord
>>>
>> behavior.
>>> BooleanQuery.netCoord could be overridden for special cases,
>>> like
>>>
>> the
>>> pure disjunction I use in my app for field expansions:
>>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
>>>
>>> Implementing this would proceed along these lines: 1.  For
>>> backwards compatibility, add some kind of newScoring
>>>
>> boolean
>>> setting.
>>> 2.  Update all of these places to behave as indicated if
>>>
>> newScoring:
>>> a.  Change Query.weight() to not do any normalization (no
>>>
>> call
>> to
>>> sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
>>> b.  Update all Query.weight classes to set their value
>>>
>> according
>> to
>>> the terms in the score formula above that don't involve the
>>>
>> document
>>> (e.g., boost*idf for TermQuery).
>>> c.  Add suitable netCoord definitions to all Scorer classes. d.
>>>  Update all Scorer.score() methods to compute the rawScore
>>>
>> as
>>> above.
>>> e.  Add the maximum baseScore as a field kept on TopDocs,
>>>
>> computed
>>> in the HitCollector's.
>>> f.  Change the normalization in Hits to always divide every
>>>
>> raw
>>> score by the maximum baseScore.
>>> g.  Update all of the current explain() methods to be
>>>
>> consistent
>>> with this scoring, and to either report the rawScore, or to
>>>
>> report
>> the
>>> final score if the normalization factor is provided. h.  Add
>>> Hits.explain() (or better extend Searcher so that it
>>>
>> keeps
>>> the Hits for use in Searcher.explain()) to call the new explain
>>> variation with the normalization factor so that final scores are
>>>
>> fully
>>> explained.
>>>
>>> If this seems like a good idea, please let me know.  I'm sure
>>>
>> there
>> are
>>> details I've missed that would come out during coding and
>>>
>> testing.
>> Also,
>>> the value of this is dependent on how reasonable the final
>>> scores
>>>
>> look,
>>> which is hard to tell for sure until it is working.
>>>
>>> The coding standards for Lucene seem reasonably clear from the
>>>
>> source
>>> code I've read.  I could use just simple Java so there shouldn't
>>>
>> be
>> any
>>> significant JVM dependencies.  The above should be fully
>>> backward compatible due to the newScoring flag.
>>>
>>> On another note, I had to remove the German analyzer in my
>>>
>> current
>> 1.4.2
>>> source configuration because GermanStemmer failed to compile due
>>>
>> to
>> what
>>> are apparently Unicode character constants that I've now got as
>>>
>> illegal
>>> two-character character constants.  This is presumably an
>>>
>> encoding
>>> problem somewhere that I could track down.  It's not important,
>>>
>> but
>> if
>>> the answer is obvious to any of you, I'd appreciate the quick
>>>
>> tip.
>>
>>> Thanks,
>>>
>>> Chuck
>>>
>>>> -----Original Message-----
>>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday,
>>>> October 18, 2004 9:44 AM To: Lucene Developers List
>>>> Subject: Re: idf and explain(), was Re: Search and Scoring
>>>>
>>>> Chuck Williams wrote:
>>>>> That's a good point on how the standard vector space inner
>>>>>
>> product
>>>>> similarity measure does imply that the idf is squared
>>>>>
>> relative
>> to
>>> the
>>>>> document tf.  Even having been aware of this formula for a
>>>>>
>> long
>>> time,
>>>>> this particular implication never occurred to me.  Do you
>>>>>
>> know
>> if
>>>>> anybody has done precision/recall or other relevancy
>>>>>
>> empirical
>>>>> measurements comparing this vs. a model that does not
>>>>>
>> square
>> idf?
>>
>>>> No, not that I know of.
>>>>
>>>>> Regarding normalization, the normalization in Hits does not
>>>>>
>> have
>>> very
>>>>> nice properties.  Due to the > 1.0 threshold check, it
>>>>>
>> loses
>>>>> information, and it arbitrarily defines the highest scoring
>>>>>
>> result
>>> in
>>>>> any list that generates scores above 1.0 as a perfect
>>>>>
>> match.
>> It
>>> would
>>>>> be nice if score values were meaningful independent of
>>>>>
>> searches,
>>> e.g.,
>>>>> if "0.6" meant the same quality of retrieval independent of
>>>>>
>> what
>>>> search
>>>>> was done.  This would allow, for example, sites to use a a
>>>>>
>> simple
>>>>> quality threshold to only show results that were "good
>>>>>
>> enough".
>>> At my
>>>>> last company (I was President and head of engineering for
>>>>>
>> InQuira),
>>> we
>>>>> found this to be important to many customers.
>>>>>
>>>> If this is a big issue for you, as it seems it is, please
>>>>
>> submit
>> a
>>> patch
>>>> to optionally disable score normalization in Hits.java.
>>>>
>>>>> The standard vector space similarity measure includes
>>>>>
>>> normalization by
>>>>> the product of the norms of the vectors, i.e.:
>>>>>
>>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
>>>>>
>> /
>>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>>>>>
>>>> weight(t,d)^2) ]
>>>>
>>>>> This makes the score a cosine, which since the values are
>>>>>
>> all
>>> positive,
>>>>> forces it to be in [0, 1].  The sumOfSquares()
>>>>>
>> normalization
>> in
>>> Lucene
>>>>> does not fully implement this.  Is there a specific reason
>>>>>
>> for
>>> that?
>>>
>>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
>>>>
>> each
>>> document
>>>> each time a document is added to the collection, since
>>>>
>> 'weight(t,d)'
>>> is
>>>> dependent on global term statistics.  This is prohibitivly
>>>>
>> expensive.
>>>> Research has also demonstrated that such cosine normalization
>>>>
>> gives
>>>> somewhat inferior results (e.g., Singhal's pivoted length
>>>>
>>> normalization).
>>>
>>>>> Re. explain(), I don't see a downside to extending it show
>>>>>
>> the
>>> final
>>>>> normalization in Hits.  It could still show the raw score
>>>>>
>> just
>>> prior
>>>> to
>>>>> that normalization.
>>>>>
>>>> In order to normalize scores to 1.0 one must know the maximum
>>>>
>> score.
>>>> Explain only computes the score for a single document, and
>>>>
>> the
>>> maximum
>>>> score is not known.
>>>>
>>>>> Although I think it would be best to have a normalization
>>>>> that would render scores comparable across
>>>>>
>> searches.
>>
>>>> Then please submit a patch.  Lucene doesn't change on its
>>>>
>> own.
>>
>>>> Doug
>>>>
>>>>
>> ------------------------------------------------------------------
>> --
>>
>>> -
>>>> To unsubscribe, e-mail:
>>[EMAIL PROTECTED]
>>>> For additional commands, e-mail:
>>[EMAIL PROTECTED]
>>
>>
>> ------------------------------------------------------------------
>> --- To unsubscribe, e-mail: lucene-dev-
>>[EMAIL PROTECTED] For additional commands, e-mail:
>>[EMAIL PROTECTED]
>
>
> --------------------------------------------------------------------
> - To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Reply via email to