Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Otis Gospodnetic Sat, 05 Feb 2005 22:10:32 -0800

Exactly.  Luckily, since then I've learned a bit from lucene-dev
discussions and side IR readings, so some of the topics are making more
sense now.


Otis

--- Kelvin Tan <[EMAIL PROTECTED]> wrote:

> Hi Otis, I was re-reading this whole theoretical thread about idf,
> scoring, normalization, etc from last Oct and couldn't help laughing
> out loud when I read your post, coz it summed up what I was thinking
> the whole time. I think its really great to have people like Chuck
> and Paul (Eshlot) around. I'm learning so much.
> 
> k
> 
> On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
> > Hi Chuck,
> >
> > The relative lack of responses is not because there is no interest,
> > but probably because there are only a few people on lucene-dev who
> > can follow/understand every detail of your proposal.  I understand
> > and hear you, but I have a hard time 'visualizing' some of the
> > formulas in your proposal.  What you are saying sounds right to me,
> > but I don't have enough theoretical knowledge to go one way or the
> > other.
> >
> > Otis
> >
> >
> > --- Chuck Williams <[EMAIL PROTECTED]> wrote:
> >
> >> Hi everybody,
> >>
> >> Although there doesn't seem to be much interest in this I have
> >> one significant improvement to the below and a couple
> >> observations that clarify the situation.
> >>
> >> To illustrate the problem better normalization is intended to
> >> address,
> >> in my current application for BooleanQuery's of two terms, I
> >> frequently
> >> get a top score of 1.0 when only one of the terms is matched.
> >> 1.0 should indicate a "perfect match".  I'd like set my UI up to
> >> present the
> >> results differently depending on how good the different results
> >> are (e.g., showing a visual indication of result quality,
> >> dropping the really bad results entirely, and segregating the
> >> good results from other
> >> only vaguely relevant results).  To build this kind of
> >> "intelligence" into the UI, I certainly need to know whether my
> >> top result matched all
> >> query terms, or only half of them.  I'd like to have the score
> >> tell me
> >> directly how good the matches are.  The current normalization
> >> does not achieve this.
> >>
> >> The intent of a new normalization scheme is to preserve the
> >> current scoring in the following sense:  the ratio of the nth
> >> result's score to
> >> the best result's score remains the same.  Therefore, the only
> >> question
> >> is what normalization factor to apply to all scores.  This
> >> reduces to a
> >> very specific question that determines the entire normalization --
> >>  what should the score of the best matching result be?
> >>
> >> The mechanism below has this property, i.e. it keeps the current
> >> score
> >> ratios, except that I removed one idf term for reasons covered
> >> earlier
> >> (this better reflects the current empirically best scoring
> >> algorithms).
> >> However, removing an idf is a completely separate issue.  The
> >> improved
> >> normalization is independent of whether or not that change is
> >> made.
> >>
> >> For the central question of what the top score should be, upon
> >> reflection, I don't like the definition below.  It defined the
> >> top score
> >> as (approximately) the percentage of query terms matched by the
> >> top scoring result.  A better conceptual definition is to use a
> >> weighted average based on the boosts.  I.e., downward propagate
> >> all boosts to the
> >> underlying terms (or phrases).  Secifically, the "net boost" of a
> >> term
> >> is the product of the direct boost of the term and all boosts
> >> applied to
> >> encompassing clauses.  Then the score of the top result becomes
> >> the sum
> >> of the net boosts of its matching terms divided by the sum of the
> >> net boosts of all query terms.
> >>
> >> This definition is a refinement of the original proposal below,
> >> and it
> >> could probably be further refined if some impact of the tf, idf
> >> and/or
> >> lengthNorm was desired in determining the top score.  These other
> >> factors seems to be harder to normalize for, although I've
> >> thought of some simple approaches; e.g., assume the unmatched
> >> terms in the top result have values for these three factors that
> >> are the average of the
> >> values of the matched terms, then apply exactly the same concept
> >> of actual score divided by theorectical maximum score.  That
> >> would eliminate any need to maintain maximum value statistics in
> >> the index.
> >>
> >> As an example of the simple boost-based normalization, for the
> >> query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --> 3 c --
> >> > 1 d --> 2
> >>
> >> So if a and b matched, but not c and d, in the top scoring
> >> result, its
> >> score would be 0.75.  The normalizer would be 0.75/(current score
> >> except
> >> for the current normalization).  This normalizer would be applied
> >> to all
> >> current scores (minus normalization) to create the normalized
> >> scores.
> >>
> >> For simple query (a b), if only one of the terms matched in the
> >> top result, then its score would be 0.5, vs. 1.0 or many other
> >> possible scores today.
> >>
> >> In addition to enabling more "intelligent" UI's that communicate
> >> the quality of results to end-users, the proposal below also
> >> extends the explain() mechanism to fully explain the final
> >> normalized score. However, that change is also independent -- it
> >> could be done with the current scoring.
> >>
> >> Am I the only one who would like to see better normalization in
> >> Lucene? Does anybody have a better approach?
> >>
> >> If you've read this far, thanks for indulging me on this.  I
> >> would love
> >> to see this or something with similar properties in Lucene.  I
> >> really just want to build my app, but as stated below would write
> >> and contribute this if there is interest in putting it in, and
> >> nobody else
> >> wants to write it.  Please let me know what you think one way or
> >> the other.
> >>
> >> Thanks,
> >>
> >> Chuck
> >>
> >>
> >>> -----Original Message-----
> >>> From: Chuck Williams
> >>> Sent: Monday, October 18, 2004 7:04 PM
> >>> To: 'Lucene Developers List'
> >>> Subject: RE: idf and explain(), was Re: Search and Scoring
> >>>
> >>> Doug Cutting wrote:
> >>>> If this is a big issue for you, as it seems it is, please
> >>>>
> >> submit
> >> a
> >>> patch
> >>>> to optionally disable score normalization in Hits.java.
> >>>>
> >>> and:
> >>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> >>>>
> >> each
> >>> document
> >>>> each time a document is added to the collection, since
> >>>>
> >> 'weight(t,d)'
> >>> is
> >>>> dependent on global term statistics.  This is prohibitivly
> >>>>
> >> expensive.
> >>>> Research has also demonstrated that such cosine normalization
> >>>>
> >> gives
> >>>> somewhat inferior results (e.g., Singhal's pivoted length
> >>>>
> >>> normalization).
> >>>
> >>> I'm willing to write, test and contribute code to address the
> >>> normalization issue, i.e. to yield scores in [0, 1] that are
> >>>
> >> meaningful
> >>> across searches.  Unfortunately, this is considerably more
> >>>
> >> involved
> >> that
> >>> just optionally eliminating the current normalization in Hits.
> >>>
> >> Before
> >>> undertaking this, I'd like to see if there is agreement in
> >>>
> >> principle
> >>> that this is a good idea, and that my specific proposal below is
> >>>
> >> the
> >>> right way to go.  Also, I'd like to make sure I've correctly
> >>>
> >> inferred
> >>> the constraints on writing code to be incorporated into Lucene.
> >>>
> >>> After looking at this in more detail I agree that the cosine
> >>> normalization is not the way to go, because of both efficiency
> >>>
> >> and
> >>> effectiveness considerations.  A conceptual approach that would
> >>>
> >> be
> >>> efficient, relatively easy to implement, and seems to have at
> >>>
> >> least
> >>> reasonable behavior would be to define the top scoring match to
> >>>
> >> have
> >> a
> >>> score roughly equal to the percentage of query terms it matches
> >>>
> >> (its
> >>> "netCoord").  Scores below the top hit would be reduced based on
> >>>
> >> the
> >>> ratio of their raw scores to the raw score of the top hit
> >>>
> >> (considering
> >>> all of the current score factors, except that I'd like to remove
> >>>
> >> one
> >> of
> >>> the idf factors, as discussed earlier).
> >>>
> >>> For a couple simple cases:
> >>> a) the top match for a single term query would always have a
> >>>
> >> score
> >> of
> >>> 1.0,
> >>> b) the top scoring match for a BooleanQuery using
> >>>
> >> DefaultSimilarity
> >>> with all non-prohibited TermQuery clauses would have a score of
> >>>
> >> m/n,
> >>> where the hit matches m of the n terms.
> >>>
> >>> This isn't optimal, but seems much better than the current
> >>>
> >> situation.
> >>> Consider two single-term queries, s and t.  If s matches more
> >>>
> >> strongly
> >>> than t in its top hit (e.g., a higher tf in a shorter field), it
> >>>
> >> would
> >>> be best if the top score of s was greater score than top score
> >>> of
> >>>
> >> t.
> >>> But this kind of normalization would require keeping additional
> >>> statistics that so far as I know are not currently in the index,
> >>>
> >> like
> >>> the maximum tf for terms and the minimum length for fields.
> >>>
> >> These
> >> could
> >>> be expensive to update on deletes.  Also, normalizing by such
> >>>
> >> factors
> >>> could yield lower than subjectively reasonable scores in most
> >>>
> >> cases,
> >> so
> >>> it's not clear it would be better.
> >>>
> >>> The semantics above are at least clean, easy to understand, and
> >>>
> >> support
> >>> what seems to me is the most important motivation to do this:
> >>>
> >> allowing
> >>> an application to use simple thresholding to segregate
> >>>
> >> likely-to-be-
> >>> relevant hits from likely-to-be-irrelevant hits.
> >>>
> >>> More specifically, for a BooleanQuery of TermQuery's the
> >>>
> >> resulting
> >> score
> >>> functions would be:
> >>>
> >>> BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
> >>>
> >>> baseScore(sbq, doc) =
> >>> sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
> >>> lengthNorm(tqi.term.field, doc)
> >>>
> >>> rawScore(sbq, doc) = coord(sbq, doc) * baseScore
> >>>
> >>> norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
> >>>
> >>> score(sbq, doc) = rawScore * norm
> >>>
> >>> rawScore's can be computed in the Scorer.score() methods and
> >>>
> >> therefore
> >>> used to sort the hits (e.g., in the instance method for
> >>> collect()
> >>>
> >> in
> >> the
> >>> HitCollector in IndexSearcher.search()).  If the top scoring hit
> >>>
> >> does
> >>> not have the highest baseScore, then its score could be less
> >>> that
> >>>
> >> its
> >>> coord; this seems desirable.  These formulas imply that no
> >>> result
> >>>
> >> score
> >>> can be larger than its coord, so if coord is well-defined
> >>> (always between 0 and 1) then all results will be normalized
> >>> between 0
> >>>
> >> and
> >> 1.
> >>
> >>> In general, the netCoord, which takes the place of coord in the
> >>>
> >> simple
> >>> case above, needs to be defined for any query.  Conceptually,
> >>>
> >> this
> >>> should be the total percentage of query terms matched by the
> >>>
> >> document.
> >>> It must be recursively computable from the subquery structure
> >>> and overridable for specific Query types (e.g., to support
> >>>
> >> specialized
> >>> coords, like one that is always 1.0 as is useful in multi-field
> >>> searching).  Suitable default definitions for TermQuery and
> >>>
> >> BooleanQuery
> >>> are:
> >>>
> >>> TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
> >>>
> >>> BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
> >>>
> >> ci.netCoord
> >>
> >>> This is not quite percentage of terms matched; e.g., consider a
> >>> BooleanQuery with two clauses, one of which is a BooleanQuery of
> >>>
> >> 3
> >> terms
> >>> and the other which is just a term.  However, it doesn't seem to
> >>>
> >> be
> >>> unreasonable for a BooleanQuery to state that its clauses are
> >>>
> >> equally
> >>> important, and this is consistent with the current coord
> >>>
> >> behavior.
> >>> BooleanQuery.netCoord could be overridden for special cases,
> >>> like
> >>>
> >> the
> >>> pure disjunction I use in my app for field expansions:
> >>> MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
> >>>
> >>> Implementing this would proceed along these lines: 1.  For
> >>> backwards compatibility, add some kind of newScoring
> >>>
> >> boolean
> >>> setting.
> >>> 2.  Update all of these places to behave as indicated if
> >>>
> >> newScoring:
> >>> a.  Change Query.weight() to not do any normalization (no
> >>>
> >> call
> >> to
> >>> sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
> >>> b.  Update all Query.weight classes to set their value
> >>>
> >> according
> >> to
> >>> the terms in the score formula above that don't involve the
> >>>
> >> document
> >>> (e.g., boost*idf for TermQuery).
> >>> c.  Add suitable netCoord definitions to all Scorer classes. d.
> >>>  Update all Scorer.score() methods to compute the rawScore
> >>>
> >> as
> >>> above.
> >>> e.  Add the maximum baseScore as a field kept on TopDocs,
> >>>
> >> computed
> >>> in the HitCollector's.
> >>> f.  Change the normalization in Hits to always divide every
> >>>
> >> raw
> >>> score by the maximum baseScore.
> >>> g.  Update all of the current explain() methods to be
> >>>
> >> consistent
> >>> with this scoring, and to either report the rawScore, or to
> >>>
> >> report
> >> the
> >>> final score if the normalization factor is provided. h.  Add
> >>> Hits.explain() (or better extend Searcher so that it
> >>>
> >> keeps
> >>> the Hits for use in Searcher.explain()) to call the new explain
> >>> variation with the normalization factor so that final scores are
> >>>
> >> fully
> >>> explained.
> >>>
> >>> If this seems like a good idea, please let me know.  I'm sure
> >>>
> >> there
> >> are
> >>> details I've missed that would come out during coding and
> >>>
> >> testing.
> >> Also,
> >>> the value of this is dependent on how reasonable the final
> >>> scores
> >>>
> >> look,
> >>> which is hard to tell for sure until it is working.
> >>>
> >>> The coding standards for Lucene seem reasonably clear from the
> >>>
> >> source
> >>> code I've read.  I could use just simple Java so there shouldn't
> >>>
> >> be
> >> any
> >>> significant JVM dependencies.  The above should be fully
> >>> backward compatible due to the newScoring flag.
> >>>
> >>> On another note, I had to remove the German analyzer in my
> >>>
> >> current
> >> 1.4.2
> >>> source configuration because GermanStemmer failed to compile due
> >>>
> >> to
> >> what
> >>> are apparently Unicode character constants that I've now got as
> >>>
> >> illegal
> >>> two-character character constants.  This is presumably an
> >>>
> >> encoding
> >>> problem somewhere that I could track down.  It's not important,
> >>>
> >> but
> >> if
> >>> the answer is obvious to any of you, I'd appreciate the quick
> >>>
> >> tip.
> >>
> >>> Thanks,
> >>>
> >>> Chuck
> >>>
> >>>> -----Original Message-----
> >>>> From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday,
> >>>> October 18, 2004 9:44 AM To: Lucene Developers List
> >>>> Subject: Re: idf and explain(), was Re: Search and Scoring
> >>>>
> >>>> Chuck Williams wrote:
> >>>>> That's a good point on how the standard vector space inner
> >>>>>
> >> product
> >>>>> similarity measure does imply that the idf is squared
> >>>>>
> >> relative
> >> to
> >>> the
> >>>>> document tf.  Even having been aware of this formula for a
> >>>>>
> >> long
> >>> time,
> >>>>> this particular implication never occurred to me.  Do you
> >>>>>
> >> know
> >> if
> >>>>> anybody has done precision/recall or other relevancy
> >>>>>
> >> empirical
> >>>>> measurements comparing this vs. a model that does not
> >>>>>
> >> square
> >> idf?
> >>
> >>>> No, not that I know of.
> >>>>
> >>>>> Regarding normalization, the normalization in Hits does not
> >>>>>
> >> have
> >>> very
> >>>>> nice properties.  Due to the > 1.0 threshold check, it
> >>>>>
> >> loses
> >>>>> information, and it arbitrarily defines the highest scoring
> >>>>>
> >> result
> >>> in
> >>>>> any list that generates scores above 1.0 as a perfect
> >>>>>
> >> match.
> >> It
> >>> would
> >>>>> be nice if score values were meaningful independent of
> >>>>>
> >> searches,
> >>> e.g.,
> >>>>> if "0.6" meant the same quality of retrieval independent of
> >>>>>
> >> what
> >>>> search
> >>>>> was done.  This would allow, for example, sites to use a a
> >>>>>
> >> simple
> >>>>> quality threshold to only show results that were "good
> >>>>>
> >> enough".
> >>> At my
> >>>>> last company (I was President and head of engineering for
> >>>>>
> >> InQuira),
> >>> we
> >>>>> found this to be important to many customers.
> >>>>>
> >>>> If this is a big issue for you, as it seems it is, please
> >>>>
> >> submit
> >> a
> >>> patch
> >>>> to optionally disable score normalization in Hits.java.
> >>>>
> >>>>> The standard vector space similarity measure includes
> >>>>>
> >>> normalization by
> >>>>> the product of the norms of the vectors, i.e.:
> >>>>>
> >>>>> score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> >>>>>
> >> /
> >>>>> sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
> >>>>>
> >>>> weight(t,d)^2) ]
> >>>>
> >>>>> This makes the score a cosine, which since the values are
> >>>>>
> >> all
> >>> positive,
> >>>>> forces it to be in [0, 1].  The sumOfSquares()
> >>>>>
> >> normalization
> >> in
> >>> Lucene
> >>>>> does not fully implement this.  Is there a specific reason
> >>>>>
> >> for
> >>> that?
> >>>
> >>>> The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> >>>>
> >> each
> >>> document
> >>>> each time a document is added to the collection, since
> >>>>
> >> 'weight(t,d)'
> >>> is
> >>>> dependent on global term statistics.  This is prohibitivly
> >>>>
> >> expensive.
> >>>> Research has also demonstrated that such cosine normalization
> >>>>
> >> gives
> >>>> somewhat inferior results (e.g., Singhal's pivoted length
> >>>>
> >>> normalization).
> >>>
> >>>>> Re. explain(), I don't see a downside to extending it show
> >>>>>
> >> the
> >>> final
> >>>>> normalization in Hits.  It could still show the raw score
> >>>>>
> >> just
> >>> prior
> >>>> to
> >>>>> that normalization.
> >>>>>
> >>>> In order to normalize scores to 1.0 one must know the maximum
> >>>>
> >> score.
> >>>> Explain only computes the score for a single document, and
> >>>>
> >> the
> >>> maximum
> >>>> score is not known.
> >>>>
> >>>>> Although I think it would be best to have a normalization
> >>>>> that would render scores comparable across
> >>>>>
> >> searches.
> >>
> >>>> Then please submit a patch.  Lucene doesn't change on its
> >>>>
> >> own.
> >>
> >>>> Doug
> >>>>
> >>>>
> >> ------------------------------------------------------------------
> >> --
> >>
> >>> -
> >>>> To unsubscribe, e-mail:
> >> [EMAIL PROTECTED]
> >>>> For additional commands, e-mail:
> >> [EMAIL PROTECTED]
> >>
> >>
> >> ------------------------------------------------------------------
> >> --- To unsubscribe, e-mail: lucene-dev-
> >> [EMAIL PROTECTED] For additional commands, e-mail:
> >> [EMAIL PROTECTED]
> >
> >
> >
--------------------------------------------------------------------
> > - To unsubscribe, e-mail: [EMAIL PROTECTED]
> > For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Reply via email to