Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Otis Gospodnetic Thu, 21 Oct 2004 10:05:55 -0700

Hi Chuck,

The relative lack of responses is not because there is no interest, but
probably because there are only a few people on lucene-dev who can
follow/understand every detail of your proposal.  I understand and hear
you, but I have a hard time 'visualizing' some of the formulas in your
proposal.  What you are saying sounds right to me, but I don't have
enough theoretical knowledge to go one way or the other.


Otis


--- Chuck Williams <[EMAIL PROTECTED]> wrote:

> Hi everybody,
> 
> Although there doesn't seem to be much interest in this I have one
> significant improvement to the below and a couple observations that
> clarify the situation.
> 
> To illustrate the problem better normalization is intended to
> address,
> in my current application for BooleanQuery's of two terms, I
> frequently
> get a top score of 1.0 when only one of the terms is matched.  1.0
> should indicate a "perfect match".  I'd like set my UI up to present
> the
> results differently depending on how good the different results are
> (e.g., showing a visual indication of result quality, dropping the
> really bad results entirely, and segregating the good results from
> other
> only vaguely relevant results).  To build this kind of "intelligence"
> into the UI, I certainly need to know whether my top result matched
> all
> query terms, or only half of them.  I'd like to have the score tell
> me
> directly how good the matches are.  The current normalization does
> not
> achieve this.
> 
> The intent of a new normalization scheme is to preserve the current
> scoring in the following sense:  the ratio of the nth result's score
> to
> the best result's score remains the same.  Therefore, the only
> question
> is what normalization factor to apply to all scores.  This reduces to
> a
> very specific question that determines the entire normalization --
> what
> should the score of the best matching result be?
> 
> The mechanism below has this property, i.e. it keeps the current
> score
> ratios, except that I removed one idf term for reasons covered
> earlier
> (this better reflects the current empirically best scoring
> algorithms).
> However, removing an idf is a completely separate issue.  The
> improved
> normalization is independent of whether or not that change is made.
> 
> For the central question of what the top score should be, upon
> reflection, I don't like the definition below.  It defined the top
> score
> as (approximately) the percentage of query terms matched by the top
> scoring result.  A better conceptual definition is to use a weighted
> average based on the boosts.  I.e., downward propagate all boosts to
> the
> underlying terms (or phrases).  Secifically, the "net boost" of a
> term
> is the product of the direct boost of the term and all boosts applied
> to
> encompassing clauses.  Then the score of the top result becomes the
> sum
> of the net boosts of its matching terms divided by the sum of the net
> boosts of all query terms.
> 
> This definition is a refinement of the original proposal below, and
> it
> could probably be further refined if some impact of the tf, idf
> and/or
> lengthNorm was desired in determining the top score.  These other
> factors seems to be harder to normalize for, although I've thought of
> some simple approaches; e.g., assume the unmatched terms in the top
> result have values for these three factors that are the average of
> the
> values of the matched terms, then apply exactly the same concept of
> actual score divided by theorectical maximum score.  That would
> eliminate any need to maintain maximum value statistics in the index.
> 
> As an example of the simple boost-based normalization, for the query
>   ((a^2 b)^3 (c d^2))
> the net boosts are:
>   a --> 6
>   b --> 3
>   c --> 1
>   d --> 2
> 
> So if a and b matched, but not c and d, in the top scoring result,
> its
> score would be 0.75.  The normalizer would be 0.75/(current score
> except
> for the current normalization).  This normalizer would be applied to
> all
> current scores (minus normalization) to create the normalized scores.
> 
> For simple query (a b), if only one of the terms matched in the top
> result, then its score would be 0.5, vs. 1.0 or many other possible
> scores today.
> 
> In addition to enabling more "intelligent" UI's that communicate the
> quality of results to end-users, the proposal below also extends the
> explain() mechanism to fully explain the final normalized score.
> However, that change is also independent -- it could be done with the
> current scoring.
> 
> Am I the only one who would like to see better normalization in
> Lucene?
> Does anybody have a better approach?
> 
> If you've read this far, thanks for indulging me on this.  I would
> love
> to see this or something with similar properties in Lucene.  I really
> just want to build my app, but as stated below would write and
> contribute this if there is interest in putting it in, and nobody
> else
> wants to write it.  Please let me know what you think one way or the
> other.
> 
> Thanks,
> 
> Chuck
>  
> 
>   > -----Original Message-----
>   > From: Chuck Williams
>   > Sent: Monday, October 18, 2004 7:04 PM
>   > To: 'Lucene Developers List'
>   > Subject: RE: idf and explain(), was Re: Search and Scoring
>   > 
>   > Doug Cutting wrote:
>   >   > If this is a big issue for you, as it seems it is, please
> submit
> a
>   > patch
>   >   > to optionally disable score normalization in Hits.java.
>   > and:
>   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> each
>   > document
>   >   > each time a document is added to the collection, since
> 'weight(t,d)'
>   > is
>   >   > dependent on global term statistics.  This is prohibitivly
> expensive.
>   >   > Research has also demonstrated that such cosine normalization
> gives
>   >   > somewhat inferior results (e.g., Singhal's pivoted length
>   > normalization).
>   > 
>   > I'm willing to write, test and contribute code to address the
>   > normalization issue, i.e. to yield scores in [0, 1] that are
> meaningful
>   > across searches.  Unfortunately, this is considerably more
> involved
> that
>   > just optionally eliminating the current normalization in Hits.
> Before
>   > undertaking this, I'd like to see if there is agreement in
> principle
>   > that this is a good idea, and that my specific proposal below is
> the
>   > right way to go.  Also, I'd like to make sure I've correctly
> inferred
>   > the constraints on writing code to be incorporated into Lucene.
>   > 
>   > After looking at this in more detail I agree that the cosine
>   > normalization is not the way to go, because of both efficiency
> and
>   > effectiveness considerations.  A conceptual approach that would
> be
>   > efficient, relatively easy to implement, and seems to have at
> least
>   > reasonable behavior would be to define the top scoring match to
> have
> a
>   > score roughly equal to the percentage of query terms it matches
> (its
>   > "netCoord").  Scores below the top hit would be reduced based on
> the
>   > ratio of their raw scores to the raw score of the top hit
> (considering
>   > all of the current score factors, except that I'd like to remove
> one
> of
>   > the idf factors, as discussed earlier).
>   > 
>   > For a couple simple cases:
>   >   a) the top match for a single term query would always have a
> score
> of
>   > 1.0,
>   >   b) the top scoring match for a BooleanQuery using
> DefaultSimilarity
>   > with all non-prohibited TermQuery clauses would have a score of
> m/n,
>   > where the hit matches m of the n terms.
>   > 
>   > This isn't optimal, but seems much better than the current
> situation.
>   > Consider two single-term queries, s and t.  If s matches more
> strongly
>   > than t in its top hit (e.g., a higher tf in a shorter field), it
> would
>   > be best if the top score of s was greater score than top score of
> t.
>   > But this kind of normalization would require keeping additional
>   > statistics that so far as I know are not currently in the index,
> like
>   > the maximum tf for terms and the minimum length for fields. 
> These
> could
>   > be expensive to update on deletes.  Also, normalizing by such
> factors
>   > could yield lower than subjectively reasonable scores in most
> cases,
> so
>   > it's not clear it would be better.
>   > 
>   > The semantics above are at least clean, easy to understand, and
> support
>   > what seems to me is the most important motivation to do this:
> allowing
>   > an application to use simple thresholding to segregate
> likely-to-be-
>   > relevant hits from likely-to-be-irrelevant hits.
>   > 
>   > More specifically, for a BooleanQuery of TermQuery's the
> resulting
> score
>   > functions would be:
>   > 
>   > BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>   > 
>   > baseScore(sbq, doc) =
>   >   sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>   >            lengthNorm(tqi.term.field, doc)
>   > 
>   > rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>   > 
>   > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>   > 
>   > score(sbq, doc) = rawScore * norm
>   > 
>   > rawScore's can be computed in the Scorer.score() methods and
> therefore
>   > used to sort the hits (e.g., in the instance method for collect()
> in
> the
>   > HitCollector in IndexSearcher.search()).  If the top scoring hit
> does
>   > not have the highest baseScore, then its score could be less that
> its
>   > coord; this seems desirable.  These formulas imply that no result
> score
>   > can be larger than its coord, so if coord is well-defined (always
>   > between 0 and 1) then all results will be normalized between 0
> and
> 1.
>   > 
>   > In general, the netCoord, which takes the place of coord in the
> simple
>   > case above, needs to be defined for any query.  Conceptually,
> this
>   > should be the total percentage of query terms matched by the
> document.
>   > It must be recursively computable from the subquery structure and
>   > overridable for specific Query types (e.g., to support
> specialized
>   > coords, like one that is always 1.0 as is useful in multi-field
>   > searching).  Suitable default definitions for TermQuery and
> BooleanQuery
>   > are:
>   > 
>   > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>   > 
>   > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
> ci.netCoord
>   > 
>   > This is not quite percentage of terms matched; e.g., consider a
>   > BooleanQuery with two clauses, one of which is a BooleanQuery of
> 3
> terms
>   > and the other which is just a term.  However, it doesn't seem to
> be
>   > unreasonable for a BooleanQuery to state that its clauses are
> equally
>   > important, and this is consistent with the current coord
> behavior.
>   > BooleanQuery.netCoord could be overridden for special cases, like
> the
>   > pure disjunction I use in my app for field expansions:
>   > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
>   > 
>   > Implementing this would proceed along these lines:
>   >   1.  For backwards compatibility, add some kind of newScoring
> boolean
>   > setting.
>   >   2.  Update all of these places to behave as indicated if
> newScoring:
>   >     a.  Change Query.weight() to not do any normalization (no
> call
> to
>   > sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
>   >     b.  Update all Query.weight classes to set their value
> according
> to
>   > the terms in the score formula above that don't involve the
> document
>   > (e.g., boost*idf for TermQuery).
>   >     c.  Add suitable netCoord definitions to all Scorer classes.
>   >     d.  Update all Scorer.score() methods to compute the rawScore
> as
>   > above.
>   >     e.  Add the maximum baseScore as a field kept on TopDocs,
> computed
>   > in the HitCollector's.
>   >     f.  Change the normalization in Hits to always divide every
> raw
>   > score by the maximum baseScore.
>   >     g.  Update all of the current explain() methods to be
> consistent
>   > with this scoring, and to either report the rawScore, or to
> report
> the
>   > final score if the normalization factor is provided.
>   >     h.  Add Hits.explain() (or better extend Searcher so that it
> keeps
>   > the Hits for use in Searcher.explain()) to call the new explain
>   > variation with the normalization factor so that final scores are
> fully
>   > explained.
>   > 
>   > If this seems like a good idea, please let me know.  I'm sure
> there
> are
>   > details I've missed that would come out during coding and
> testing.
> Also,
>   > the value of this is dependent on how reasonable the final scores
> look,
>   > which is hard to tell for sure until it is working.
>   > 
>   > The coding standards for Lucene seem reasonably clear from the
> source
>   > code I've read.  I could use just simple Java so there shouldn't
> be
> any
>   > significant JVM dependencies.  The above should be fully backward
>   > compatible due to the newScoring flag.
>   > 
>   > On another note, I had to remove the German analyzer in my
> current
> 1.4.2
>   > source configuration because GermanStemmer failed to compile due
> to
> what
>   > are apparently Unicode character constants that I've now got as
> illegal
>   > two-character character constants.  This is presumably an
> encoding
>   > problem somewhere that I could track down.  It's not important,
> but
> if
>   > the answer is obvious to any of you, I'd appreciate the quick
> tip.
>   > 
>   > Thanks,
>   > 
>   > Chuck
>   > 
>   >   > -----Original Message-----
>   >   > From: Doug Cutting [mailto:[EMAIL PROTECTED]
>   >   > Sent: Monday, October 18, 2004 9:44 AM
>   >   > To: Lucene Developers List
>   >   > Subject: Re: idf and explain(), was Re: Search and Scoring
>   >   >
>   >   > Chuck Williams wrote:
>   >   > > That's a good point on how the standard vector space inner
> product
>   >   > > similarity measure does imply that the idf is squared
> relative
> to
>   > the
>   >   > > document tf.  Even having been aware of this formula for a
> long
>   > time,
>   >   > > this particular implication never occurred to me.  Do you
> know
> if
>   >   > > anybody has done precision/recall or other relevancy
> empirical
>   >   > > measurements comparing this vs. a model that does not
> square
> idf?
>   >   >
>   >   > No, not that I know of.
>   >   >
>   >   > > Regarding normalization, the normalization in Hits does not
> have
>   > very
>   >   > > nice properties.  Due to the > 1.0 threshold check, it
> loses
>   >   > > information, and it arbitrarily defines the highest scoring
> result
>   > in
>   >   > > any list that generates scores above 1.0 as a perfect
> match.
> It
>   > would
>   >   > > be nice if score values were meaningful independent of
> searches,
>   > e.g.,
>   >   > > if "0.6" meant the same quality of retrieval independent of
> what
>   >   > search
>   >   > > was done.  This would allow, for example, sites to use a a
> simple
>   >   > > quality threshold to only show results that were "good
> enough".
>   > At my
>   >   > > last company (I was President and head of engineering for
> InQuira),
>   > we
>   >   > > found this to be important to many customers.
>   >   >
>   >   > If this is a big issue for you, as it seems it is, please
> submit
> a
>   > patch
>   >   > to optionally disable score normalization in Hits.java.
>   >   >
>   >   > > The standard vector space similarity measure includes
>   > normalization by
>   >   > > the product of the norms of the vectors, i.e.:
>   >   > >
>   >   > >   score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) )
> /
>   >   > >                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
>   >   > weight(t,d)^2) ]
>   >   > >
>   >   > > This makes the score a cosine, which since the values are
> all
>   > positive,
>   >   > > forces it to be in [0, 1].  The sumOfSquares()
> normalization
> in
>   > Lucene
>   >   > > does not fully implement this.  Is there a specific reason
> for
>   > that?
>   >   >
>   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for
> each
>   > document
>   >   > each time a document is added to the collection, since
> 'weight(t,d)'
>   > is
>   >   > dependent on global term statistics.  This is prohibitivly
> expensive.
>   >   > Research has also demonstrated that such cosine normalization
> gives
>   >   > somewhat inferior results (e.g., Singhal's pivoted length
>   > normalization).
>   >   >
>   >   > > Re. explain(), I don't see a downside to extending it show
> the
>   > final
>   >   > > normalization in Hits.  It could still show the raw score
> just
>   > prior
>   >   > to
>   >   > > that normalization.
>   >   >
>   >   > In order to normalize scores to 1.0 one must know the maximum
> score.
>   >   > Explain only computes the score for a single document, and
> the
>   > maximum
>   >   > score is not known.
>   >   >
>   >   >  > Although I think it would be best to have a
>   >   >  > normalization that would render scores comparable across
> searches.
>   >   >
>   >   > Then please submit a patch.  Lucene doesn't change on its
> own.
>   >   >
>   >   > Doug
>   >   >
>   >   >
>   >   >
>   >   >
> --------------------------------------------------------------------
>   > -
>   >   > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
>   >   > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Reply via email to