RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Otis Gospodnetic Thu, 21 Oct 2004 15:50:17 -0700

Chuck - if you provide patches and post them to Bugzilla, one of the
developers will try them locally and either provide feedback or merge
your changes into CVS.


Otis

--- Chuck Williams <[EMAIL PROTECTED]> wrote:

> Thanks Otis.  Other than trying to get some consensus a) that this is
> a
> problem worth fixing, and b) on the best approach to fix it, my
> central
> question is, if I fix it is it likely to get incorporated back into
> Lucene?  I don't want to deviate from Lucene sources, especially with
> so
> many classes, and so would like to address this only if there is a
> process to evaluate the changes and incorporate them back into Lucene
> if
> they provide the improvement I believe they will.
> 
> Thanks for any guidance on that,
> 
> Chuck
> 
>   > -----Original Message-----
>   > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
>   > Sent: Thursday, October 21, 2004 10:06 AM
>   > To: Lucene Developers List
>   > Subject: Re: Normalized Scoring -- was RE: idf and explain(), was
> Re:
>   > Search and Scoring
>   > 
>   > Hi Chuck,
>   > 
>   > The relative lack of responses is not because there is no
> interest,
> but
>   > probably because there are only a few people on lucene-dev who
> can
>   > follow/understand every detail of your proposal.  I understand
> and
> hear
>   > you, but I have a hard time 'visualizing' some of the formulas in
> your
>   > proposal.  What you are saying sounds right to me, but I don't
> have
>   > enough theoretical knowledge to go one way or the other.
>   > 
>   > Otis
>   > 
>   > 
>   > --- Chuck Williams <[EMAIL PROTECTED]> wrote:
>   > 
>   > > Hi everybody,
>   > >
>   > > Although there doesn't seem to be much interest in this I have
> one
>   > > significant improvement to the below and a couple observations
> that
>   > > clarify the situation.
>   > >
>   > > To illustrate the problem better normalization is intended to
>   > > address,
>   > > in my current application for BooleanQuery's of two terms, I
>   > > frequently
>   > > get a top score of 1.0 when only one of the terms is matched. 
> 1.0
>   > > should indicate a "perfect match".  I'd like set my UI up to
> present
>   > > the
>   > > results differently depending on how good the different results
> are
>   > > (e.g., showing a visual indication of result quality, dropping
> the
>   > > really bad results entirely, and segregating the good results
> from
>   > > other
>   > > only vaguely relevant results).  To build this kind of
> "intelligence"
>   > > into the UI, I certainly need to know whether my top result
> matched
>   > > all
>   > > query terms, or only half of them.  I'd like to have the score
> tell
>   > > me
>   > > directly how good the matches are.  The current normalization
> does
>   > > not
>   > > achieve this.
>   > >
>   > > The intent of a new normalization scheme is to preserve the
> current
>   > > scoring in the following sense:  the ratio of the nth result's
> score
>   > > to
>   > > the best result's score remains the same.  Therefore, the only
>   > > question
>   > > is what normalization factor to apply to all scores.  This
> reduces
> to
>   > > a
>   > > very specific question that determines the entire normalization
> --
>   > > what
>   > > should the score of the best matching result be?
>   > >
>   > > The mechanism below has this property, i.e. it keeps the
> current
>   > > score
>   > > ratios, except that I removed one idf term for reasons covered
>   > > earlier
>   > > (this better reflects the current empirically best scoring
>   > > algorithms).
>   > > However, removing an idf is a completely separate issue.  The
>   > > improved
>   > > normalization is independent of whether or not that change is
> made.
>   > >
>   > > For the central question of what the top score should be, upon
>   > > reflection, I don't like the definition below.  It defined the
> top
>   > > score
>   > > as (approximately) the percentage of query terms matched by the
> top
>   > > scoring result.  A better conceptual definition is to use a
> weighted
>   > > average based on the boosts.  I.e., downward propagate all
> boosts
> to
>   > > the
>   > > underlying terms (or phrases).  Secifically, the "net boost" of
> a
>   > > term
>   > > is the product of the direct boost of the term and all boosts
> applied
>   > > to
>   > > encompassing clauses.  Then the score of the top result becomes
> the
>   > > sum
>   > > of the net boosts of its matching terms divided by the sum of
> the
> net
>   > > boosts of all query terms.
>   > >
>   > > This definition is a refinement of the original proposal below,
> and
>   > > it
>   > > could probably be further refined if some impact of the tf, idf
>   > > and/or
>   > > lengthNorm was desired in determining the top score.  These
> other
>   > > factors seems to be harder to normalize for, although I've
> thought
> of
>   > > some simple approaches; e.g., assume the unmatched terms in the
> top
>   > > result have values for these three factors that are the average
> of
>   > > the
>   > > values of the matched terms, then apply exactly the same
> concept
> of
>   > > actual score divided by theorectical maximum score.  That would
>   > > eliminate any need to maintain maximum value statistics in the
> index.
>   > >
>   > > As an example of the simple boost-based normalization, for the
> query
>   > >   ((a^2 b)^3 (c d^2))
>   > > the net boosts are:
>   > >   a --> 6
>   > >   b --> 3
>   > >   c --> 1
>   > >   d --> 2
>   > >
>   > > So if a and b matched, but not c and d, in the top scoring
> result,
>   > > its
>   > > score would be 0.75.  The normalizer would be 0.75/(current
> score
>   > > except
>   > > for the current normalization).  This normalizer would be
> applied
> to
>   > > all
>   > > current scores (minus normalization) to create the normalized
> scores.
>   > >
>   > > For simple query (a b), if only one of the terms matched in the
> top
>   > > result, then its score would be 0.5, vs. 1.0 or many other
> possible
>   > > scores today.
>   > >
>   > > In addition to enabling more "intelligent" UI's that
> communicate
> the
>   > > quality of results to end-users, the proposal below also
> extends
> the
>   > > explain() mechanism to fully explain the final normalized
> score.
>   > > However, that change is also independent -- it could be done
> with
> the
>   > > current scoring.
>   > >
>   > > Am I the only one who would like to see better normalization in
>   > > Lucene?
>   > > Does anybody have a better approach?
>   > >
>   > > If you've read this far, thanks for indulging me on this.  I
> would
>   > > love
>   > > to see this or something with similar properties in Lucene.  I
> really
>   > > just want to build my app, but as stated below would write and
>   > > contribute this if there is interest in putting it in, and
> nobody
>   > > else
>   > > wants to write it.  Please let me know what you think one way
> or
> the
>   > > other.
>   > >
>   > > Thanks,
>   > >
>   > > Chuck
>   > >
>   > >
>   > >   > -----Original Message-----
>   > >   > From: Chuck Williams
>   > >   > Sent: Monday, October 18, 2004 7:04 PM
>   > >   > To: 'Lucene Developers List'
>   > >   > Subject: RE: idf and explain(), was Re: Search and Scoring
>   > >   >
>   > >   > Doug Cutting wrote:
>   > >   >   > If this is a big issue for you, as it seems it is,
> please
>   > > submit
>   > > a
>   > >   > patch
>   > >   >   > to optionally disable score normalization in Hits.java.
>   > >   > and:
>   > >   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed
> for
>   > > each
>   > >   > document
>   > >   >   > each time a document is added to the collection, since
>   > > 'weight(t,d)'
>   > >   > is
>   > >   >   > dependent on global term statistics.  This is
> prohibitivly
>   > > expensive.
>   > >   >   > Research has also demonstrated that such cosine
> normalization
>   > > gives
>   > >   >   > somewhat inferior results (e.g., Singhal's pivoted
> length
>   > >   > normalization).
>   > >   >
>   > >   > I'm willing to write, test and contribute code to address
> the
>   > >   > normalization issue, i.e. to yield scores in [0, 1] that
> are
>   > > meaningful
>   > >   > across searches.  Unfortunately, this is considerably more
>   > > involved
>   > > that
>   > >   > just optionally eliminating the current normalization in
> Hits.
>   > > Before
>   > >   > undertaking this, I'd like to see if there is agreement in
>   > > principle
>   > >   > that this is a good idea, and that my specific proposal
> below
> is
>   > > the
>   > >   > right way to go.  Also, I'd like to make sure I've
> correctly
>   > > inferred
>   > >   > the constraints on writing code to be incorporated into
> Lucene.
>   > >   >
>   > >   > After looking at this in more detail I agree that the
> cosine
>   > >   > normalization is not the way to go, because of both
> efficiency
>   > > and
>   > >   > effectiveness considerations.  A conceptual approach that
> would
>   > > be
>   > >   > efficient, relatively easy to implement, and seems to have
> at
>   > > least
>   > >   > reasonable behavior would be to define the top scoring
> match
> to
>   > > have
>   > > a
>   > >   > score roughly equal to the percentage of query terms it
> matches
>   > > (its
>   > >   > "netCoord").  Scores below the top hit would be reduced
> based
> on
>   > > the
>   > >   > ratio of their raw scores to the raw score of the top hit
>   > > (considering
>   > >   > all of the current score factors, except that I'd like to
> remove
>   > > one
>   > > of
>   > >   > the idf factors, as discussed earlier).
>   > >   >
>   > >   > For a couple simple cases:
>   > >   >   a) the top match for a single term query would always
> have a
>   > > score
>   > > of
>   > >   > 1.0,
>   > >   >   b) the top scoring match for a BooleanQuery using
>   > > DefaultSimilarity
>   > >   > with all non-prohibited TermQuery clauses would have a
> score
> of
>   > > m/n,
>   > >   > where the hit matches m of the n terms.
>   > >   >
>   > >   > This isn't optimal, but seems much better than the current
>   > > situation.
>   > >   > Consider two single-term queries, s and t.  If s matches
> more
>   > > strongly
>   > >   > than t in its top hit (e.g., a higher tf in a shorter
> field),
> it
>   > > would
>   > >   > be best if the top score of s was greater score than top
> score
> of
>   > > t.
>   > >   > But this kind of normalization would require keeping
> additional
>   > >   > statistics that so far as I know are not currently in the
> index,
>   > > like
>   > >   > the maximum tf for terms and the minimum length for fields.
>   > > These
>   > > could
>   > >   > be expensive to update on deletes.  Also, normalizing by
> such
>   > > factors
>   > >   > could yield lower than subjectively reasonable scores in
> most
>   > > cases,
>   > > so
>   > >   > it's not clear it would be better.
>   > >   >
>   > >   > The semantics above are at least clean, easy to understand,
> and
>   > > support
>   > >   > what seems to me is the most important motivation to do
> this:
>   > > allowing
>   > >   > an application to use simple thresholding to segregate
>   > > likely-to-be-
>   > >   > relevant hits from likely-to-be-irrelevant hits.
>   > >   >
>   > >   > More specifically, for a BooleanQuery of TermQuery's the
>   > > resulting
>   > > score
>   > >   > functions would be:
>   > >   >
>   > >   > BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
>   > >   >
>   > >   > baseScore(sbq, doc) =
>   > >   >   sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
>   > >   >            lengthNorm(tqi.term.field, doc)
>   > >   >
>   > >   > rawScore(sbq, doc) = coord(sbq, doc) * baseScore
>   > >   >
>   > >   > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
>   > >   >
>   > >   > score(sbq, doc) = rawScore * norm
>   > >   >
>   > >   > rawScore's can be computed in the Scorer.score() methods
> and
>   > > therefore
>   > >   > used to sort the hits (e.g., in the instance method for
> collect()
>   > > in
>   > > the
>   > >   > HitCollector in IndexSearcher.search()).  If the top
> scoring
> hit
>   > > does
>   > >   > not have the highest baseScore, then its score could be
> less
> that
>   > > its
>   > >   > coord; this seems desirable.  These formulas imply that no
> result
>   > > score
>   > >   > can be larger than its coord, so if coord is well-defined
> (always
>   > >   > between 0 and 1) then all results will be normalized
> between 0
>   > > and
>   > > 1.
>   > >   >
>   > >   > In general, the netCoord, which takes the place of coord in
> the
>   > > simple
>   > >   > case above, needs to be defined for any query. 
> Conceptually,
>   > > this
>   > >   > should be the total percentage of query terms matched by
> the
>   > > document.
>   > >   > It must be recursively computable from the subquery
> structure
> and
>   > >   > overridable for specific Query types (e.g., to support
>   > > specialized
>   > >   > coords, like one that is always 1.0 as is useful in
> multi-field
>   > >   > searching).  Suitable default definitions for TermQuery and
>   > > BooleanQuery
>   > >   > are:
>   > >   >
>   > >   > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
>   > >   >
>   > >   > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
>   > > ci.netCoord
>   > >   >
>   > >   > This is not quite percentage of terms matched; e.g.,
> consider
> a
>   > >   > BooleanQuery with two clauses, one of which is a
> BooleanQuery
> of
>   > > 3
>   > > terms
>   > >   > and the other which is just a term.  However, it doesn't
> seem
> to
>   > > be
>   > >   > unreasonable for a BooleanQuery to state that its clauses
> are
>   > > equally
>   > >   > important, and this is consistent with the current coord
>   > > behavior.
>   > >   > BooleanQuery.netCoord could be overridden for special
> cases,
> like
>   > > the
>   > >   > pure disjunction I use in my app for field expansions:
>   > >   > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci)
> ci.netCoord
>   > >   >
>   > >   > Implementing this would proceed along these lines:
>   > >   >   1.  For backwards compatibility, add some kind of
> newScoring
>   > > boolean
>   > >   > setting.
>   > >   >   2.  Update all of these places to behave as indicated if
>   > > newScoring:
>   > >   >     a.  Change Query.weight() to not do any normalization
> (no
>   > > call
>   > > to
>   > >   > sumOfSquaredWeights(), Similarity.queryNorm() or
> normalize()).
>   > >   >     b.  Update all Query.weight classes to set their value
>   > > according
>   > > to
>   > >   > the terms in the score formula above that don't involve the
>   > > document
>   > >   > (e.g., boost*idf for TermQuery).
>   > >   >     c.  Add suitable netCoord definitions to all Scorer
> classes.
>   > >   >     d.  Update all Scorer.score() methods to compute the
> rawScore
>   > > as
>   > >   > above.
>   > >   >     e.  Add the maximum baseScore as a field kept on
> TopDocs,
>   > > computed
>   > >   > in the HitCollector's.
>   > >   >     f.  Change the normalization in Hits to always divide
> every
>   > > raw
>   > >   > score by the maximum baseScore.
>   > >   >     g.  Update all of the current explain() methods to be
>   > > consistent
>   > >   > with this scoring, and to either report the rawScore, or to
>   > > report
>   > > the
>   > >   > final score if the normalization factor is provided.
>   > >   >     h.  Add Hits.explain() (or better extend Searcher so
> that
> it
>   > > keeps
>   > >   > the Hits for use in Searcher.explain()) to call the new
> explain
>   > >   > variation with the normalization factor so that final
> scores
> are
>   > > fully
>   > >   > explained.
>   > >   >
>   > >   > If this seems like a good idea, please let me know.  I'm
> sure
>   > > there
>   > > are
>   > >   > details I've missed that would come out during coding and
>   > > testing.
>   > > Also,
>   > >   > the value of this is dependent on how reasonable the final
> scores
>   > > look,
>   > >   > which is hard to tell for sure until it is working.
>   > >   >
>   > >   > The coding standards for Lucene seem reasonably clear from
> the
>   > > source
>   > >   > code I've read.  I could use just simple Java so there
> shouldn't
>   > > be
>   > > any
>   > >   > significant JVM dependencies.  The above should be fully
> backward
>   > >   > compatible due to the newScoring flag.
>   > >   >
>   > >   > On another note, I had to remove the German analyzer in my
>   > > current
>   > > 1.4.2
>   > >   > source configuration because GermanStemmer failed to
> compile
> due
>   > > to
>   > > what
>   > >   > are apparently Unicode character constants that I've now
> got
> as
>   > > illegal
>   > >   > two-character character constants.  This is presumably an
>   > > encoding
>   > >   > problem somewhere that I could track down.  It's not
> important,
>   > > but
>   > > if
>   > >   > the answer is obvious to any of you, I'd appreciate the
> quick
>   > > tip.
>   > >   >
>   > >   > Thanks,
>   > >   >
>   > >   > Chuck
>   > >   >
>   > >   >   > -----Original Message-----
>   > >   >   > From: Doug Cutting [mailto:[EMAIL PROTECTED]
>   > >   >   > Sent: Monday, October 18, 2004 9:44 AM
>   > >   >   > To: Lucene Developers List
>   > >   >   > Subject: Re: idf and explain(), was Re: Search and
> Scoring
>   > >   >   >
>   > >   >   > Chuck Williams wrote:
>   > >   >   > > That's a good point on how the standard vector space
> inner
>   > > product
>   > >   >   > > similarity measure does imply that the idf is squared
>   > > relative
>   > > to
>   > >   > the
>   > >   >   > > document tf.  Even having been aware of this formula
> for
> a
>   > > long
>   > >   > time,
>   > >   >   > > this particular implication never occurred to me.  Do
> you
>   > > know
>   > > if
>   > >   >   > > anybody has done precision/recall or other relevancy
>   > > empirical
>   > >   >   > > measurements comparing this vs. a model that does not
>   > > square
>   > > idf?
>   > >   >   >
>   > >   >   > No, not that I know of.
>   > >   >   >
>   > >   >   > > Regarding normalization, the normalization in Hits
> does
> not
>   > > have
>   > >   > very
>   > >   >   > > nice properties.  Due to the > 1.0 threshold check,
> it
>   > > loses
>   > >   >   > > information, and it arbitrarily defines the highest
> scoring
>   > > result
>   > >   > in
>   > >   >   > > any list that generates scores above 1.0 as a perfect
>   > > match.
>   > > It
>   > >   > would
>   > >   >   > > be nice if score values were meaningful independent
> of
>   > > searches,
>   > >   > e.g.,
>   > >   >   > > if "0.6" meant the same quality of retrieval
> independent
> of
>   > > what
>   > >   >   > search
>   > >   >   > > was done.  This would allow, for example, sites to
> use a
> a
>   > > simple
>   > >   >   > > quality threshold to only show results that were
> "good
>   > > enough".
>   > >   > At my
>   > >   >   > > last company (I was President and head of engineering
> for
>   > > InQuira),
>   > >   > we
>   > >   >   > > found this to be important to many customers.
>   > >   >   >
>   > >   >   > If this is a big issue for you, as it seems it is,
> please
>   > > submit
>   > > a
>   > >   > patch
>   > >   >   > to optionally disable score normalization in Hits.java.
>   > >   >   >
>   > >   >   > > The standard vector space similarity measure includes
>   > >   > normalization by
>   > >   >   > > the product of the norms of the vectors, i.e.:
>   > >   >   > >
>   > >   >   > >   score(d,q) =  sum over t of ( weight(t,q) *
> weight(t,d) )
>   > > /
>   > >   >   > >                 sqrt [ (sum(t) weight(t,q)^2) *
> (sum(t)
>   > >   >   > weight(t,d)^2) ]
>   > >   >   > >
>   > >   >   > > This makes the score a cosine, which since the values
> are
>   > > all
>   > >   > positive,
>   > >   >   > > forces it to be in [0, 1].  The sumOfSquares()
>   > > normalization
>   > > in
>   > >   > Lucene
>   > >   >   > > does not fully implement this.  Is there a specific
> reason
>   > > for
>   > >   > that?
>   > >   >   >
>   > >   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed
> for
>   > > each
>   > >   > document
>   > >   >   > each time a document is added to the collection, since
>   > > 'weight(t,d)'
>   > >   > is
>   > >   >   > dependent on global term statistics.  This is
> prohibitivly
>   > > expensive.
>   > >   >   > Research has also demonstrated that such cosine
> normalization
>   > > gives
>   > >   >   > somewhat inferior results (e.g., Singhal's pivoted
> length
>   > >   > normalization).
>   > >   >   >
>   > >   >   > > Re. explain(), I don't see a downside to extending it
> show
>   > > the
>   > >   > final
>   > >   >   > > normalization in Hits.  It could still show the raw
> score
>   > > just
>   > >   > prior
>   > >   >   > to
>   > >   >   > > that normalization.
>   > >   >   >
>   > >   >   > In order to normalize scores to 1.0 one must know the
> maximum
>   > > score.
>   > >   >   > Explain only computes the score for a single document,
> and
>   > > the
>   > >   > maximum
>   > >   >   > score is not known.
>   > >   >   >
>   > >   >   >  > Although I think it would be best to have a
>   > >   >   >  > normalization that would render scores comparable
> across
>   > > searches.
>   > >   >   >
>   > >   >   > Then please submit a patch.  Lucene doesn't change on
> its
>   > > own.
>   > >   >   >
>   > >   >   > Doug
>   > >   >   >
>   > >   >   >
>   > >   >   >
>   > >   >   >
>   > >
> --------------------------------------------------------------------
>   > >   > -
>   > >   >   > To unsubscribe, e-mail:
>   > > [EMAIL PROTECTED]
>   > >   >   > For additional commands, e-mail:
>   > > [EMAIL PROTECTED]
>   > >
>   > >
>   > >
> ---------------------------------------------------------------------
>   > > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
>   > > For additional commands, e-mail:
> [EMAIL PROTECTED]
>   > >
>   > >
>   > 
>   > 
>   >
> ---------------------------------------------------------------------
>   > To unsubscribe, e-mail: [EMAIL PROTECTED]
>   > For additional commands, e-mail:
> [EMAIL PROTECTED]
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Reply via email to