RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Chuck Williams Thu, 21 Oct 2004 11:05:09 -0700

Thanks Otis.  Other than trying to get some consensus a) that this is a
problem worth fixing, and b) on the best approach to fix it, my central
question is, if I fix it is it likely to get incorporated back into
Lucene?  I don't want to deviate from Lucene sources, especially with so
many classes, and so would like to address this only if there is a
process to evaluate the changes and incorporate them back into Lucene if
they provide the improvement I believe they will.


Thanks for any guidance on that,

Chuck

  > -----Original Message-----
  > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, October 21, 2004 10:06 AM
  > To: Lucene Developers List
  > Subject: Re: Normalized Scoring -- was RE: idf and explain(), was
Re:
  > Search and Scoring
  > 
  > Hi Chuck,
  > 
  > The relative lack of responses is not because there is no interest,
but
  > probably because there are only a few people on lucene-dev who can
  > follow/understand every detail of your proposal.  I understand and
hear
  > you, but I have a hard time 'visualizing' some of the formulas in
your
  > proposal.  What you are saying sounds right to me, but I don't have
  > enough theoretical knowledge to go one way or the other.
  > 
  > Otis
  > 
  > 
  > --- Chuck Williams <[EMAIL PROTECTED]> wrote:
  > 
  > > Hi everybody,
  > >
  > > Although there doesn't seem to be much interest in this I have one
  > > significant improvement to the below and a couple observations
that
  > > clarify the situation.
  > >
  > > To illustrate the problem better normalization is intended to
  > > address,
  > > in my current application for BooleanQuery's of two terms, I
  > > frequently
  > > get a top score of 1.0 when only one of the terms is matched.  1.0
  > > should indicate a "perfect match".  I'd like set my UI up to
present
  > > the
  > > results differently depending on how good the different results
are
  > > (e.g., showing a visual indication of result quality, dropping the
  > > really bad results entirely, and segregating the good results from
  > > other
  > > only vaguely relevant results).  To build this kind of
"intelligence"
  > > into the UI, I certainly need to know whether my top result
matched
  > > all
  > > query terms, or only half of them.  I'd like to have the score
tell
  > > me
  > > directly how good the matches are.  The current normalization does
  > > not
  > > achieve this.
  > >
  > > The intent of a new normalization scheme is to preserve the
current
  > > scoring in the following sense:  the ratio of the nth result's
score
  > > to
  > > the best result's score remains the same.  Therefore, the only
  > > question
  > > is what normalization factor to apply to all scores.  This reduces
to
  > > a
  > > very specific question that determines the entire normalization --
  > > what
  > > should the score of the best matching result be?
  > >
  > > The mechanism below has this property, i.e. it keeps the current
  > > score
  > > ratios, except that I removed one idf term for reasons covered
  > > earlier
  > > (this better reflects the current empirically best scoring
  > > algorithms).
  > > However, removing an idf is a completely separate issue.  The
  > > improved
  > > normalization is independent of whether or not that change is
made.
  > >
  > > For the central question of what the top score should be, upon
  > > reflection, I don't like the definition below.  It defined the top
  > > score
  > > as (approximately) the percentage of query terms matched by the
top
  > > scoring result.  A better conceptual definition is to use a
weighted
  > > average based on the boosts.  I.e., downward propagate all boosts
to
  > > the
  > > underlying terms (or phrases).  Secifically, the "net boost" of a
  > > term
  > > is the product of the direct boost of the term and all boosts
applied
  > > to
  > > encompassing clauses.  Then the score of the top result becomes
the
  > > sum
  > > of the net boosts of its matching terms divided by the sum of the
net
  > > boosts of all query terms.
  > >
  > > This definition is a refinement of the original proposal below,
and
  > > it
  > > could probably be further refined if some impact of the tf, idf
  > > and/or
  > > lengthNorm was desired in determining the top score.  These other
  > > factors seems to be harder to normalize for, although I've thought
of
  > > some simple approaches; e.g., assume the unmatched terms in the
top
  > > result have values for these three factors that are the average of
  > > the
  > > values of the matched terms, then apply exactly the same concept
of
  > > actual score divided by theorectical maximum score.  That would
  > > eliminate any need to maintain maximum value statistics in the
index.
  > >
  > > As an example of the simple boost-based normalization, for the
query
  > >   ((a^2 b)^3 (c d^2))
  > > the net boosts are:
  > >   a --> 6
  > >   b --> 3
  > >   c --> 1
  > >   d --> 2
  > >
  > > So if a and b matched, but not c and d, in the top scoring result,
  > > its
  > > score would be 0.75.  The normalizer would be 0.75/(current score
  > > except
  > > for the current normalization).  This normalizer would be applied
to
  > > all
  > > current scores (minus normalization) to create the normalized
scores.
  > >
  > > For simple query (a b), if only one of the terms matched in the
top
  > > result, then its score would be 0.5, vs. 1.0 or many other
possible
  > > scores today.
  > >
  > > In addition to enabling more "intelligent" UI's that communicate
the
  > > quality of results to end-users, the proposal below also extends
the
  > > explain() mechanism to fully explain the final normalized score.
  > > However, that change is also independent -- it could be done with
the
  > > current scoring.
  > >
  > > Am I the only one who would like to see better normalization in
  > > Lucene?
  > > Does anybody have a better approach?
  > >
  > > If you've read this far, thanks for indulging me on this.  I would
  > > love
  > > to see this or something with similar properties in Lucene.  I
really
  > > just want to build my app, but as stated below would write and
  > > contribute this if there is interest in putting it in, and nobody
  > > else
  > > wants to write it.  Please let me know what you think one way or
the
  > > other.
  > >
  > > Thanks,
  > >
  > > Chuck
  > >
  > >
  > >   > -----Original Message-----
  > >   > From: Chuck Williams
  > >   > Sent: Monday, October 18, 2004 7:04 PM
  > >   > To: 'Lucene Developers List'
  > >   > Subject: RE: idf and explain(), was Re: Search and Scoring
  > >   >
  > >   > Doug Cutting wrote:
  > >   >   > If this is a big issue for you, as it seems it is, please
  > > submit
  > > a
  > >   > patch
  > >   >   > to optionally disable score normalization in Hits.java.
  > >   > and:
  > >   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for
  > > each
  > >   > document
  > >   >   > each time a document is added to the collection, since
  > > 'weight(t,d)'
  > >   > is
  > >   >   > dependent on global term statistics.  This is prohibitivly
  > > expensive.
  > >   >   > Research has also demonstrated that such cosine
normalization
  > > gives
  > >   >   > somewhat inferior results (e.g., Singhal's pivoted length
  > >   > normalization).
  > >   >
  > >   > I'm willing to write, test and contribute code to address the
  > >   > normalization issue, i.e. to yield scores in [0, 1] that are
  > > meaningful
  > >   > across searches.  Unfortunately, this is considerably more
  > > involved
  > > that
  > >   > just optionally eliminating the current normalization in Hits.
  > > Before
  > >   > undertaking this, I'd like to see if there is agreement in
  > > principle
  > >   > that this is a good idea, and that my specific proposal below
is
  > > the
  > >   > right way to go.  Also, I'd like to make sure I've correctly
  > > inferred
  > >   > the constraints on writing code to be incorporated into
Lucene.
  > >   >
  > >   > After looking at this in more detail I agree that the cosine
  > >   > normalization is not the way to go, because of both efficiency
  > > and
  > >   > effectiveness considerations.  A conceptual approach that
would
  > > be
  > >   > efficient, relatively easy to implement, and seems to have at
  > > least
  > >   > reasonable behavior would be to define the top scoring match
to
  > > have
  > > a
  > >   > score roughly equal to the percentage of query terms it
matches
  > > (its
  > >   > "netCoord").  Scores below the top hit would be reduced based
on
  > > the
  > >   > ratio of their raw scores to the raw score of the top hit
  > > (considering
  > >   > all of the current score factors, except that I'd like to
remove
  > > one
  > > of
  > >   > the idf factors, as discussed earlier).
  > >   >
  > >   > For a couple simple cases:
  > >   >   a) the top match for a single term query would always have a
  > > score
  > > of
  > >   > 1.0,
  > >   >   b) the top scoring match for a BooleanQuery using
  > > DefaultSimilarity
  > >   > with all non-prohibited TermQuery clauses would have a score
of
  > > m/n,
  > >   > where the hit matches m of the n terms.
  > >   >
  > >   > This isn't optimal, but seems much better than the current
  > > situation.
  > >   > Consider two single-term queries, s and t.  If s matches more
  > > strongly
  > >   > than t in its top hit (e.g., a higher tf in a shorter field),
it
  > > would
  > >   > be best if the top score of s was greater score than top score
of
  > > t.
  > >   > But this kind of normalization would require keeping
additional
  > >   > statistics that so far as I know are not currently in the
index,
  > > like
  > >   > the maximum tf for terms and the minimum length for fields.
  > > These
  > > could
  > >   > be expensive to update on deletes.  Also, normalizing by such
  > > factors
  > >   > could yield lower than subjectively reasonable scores in most
  > > cases,
  > > so
  > >   > it's not clear it would be better.
  > >   >
  > >   > The semantics above are at least clean, easy to understand,
and
  > > support
  > >   > what seems to me is the most important motivation to do this:
  > > allowing
  > >   > an application to use simple thresholding to segregate
  > > likely-to-be-
  > >   > relevant hits from likely-to-be-irrelevant hits.
  > >   >
  > >   > More specifically, for a BooleanQuery of TermQuery's the
  > > resulting
  > > score
  > >   > functions would be:
  > >   >
  > >   > BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
  > >   >
  > >   > baseScore(sbq, doc) =
  > >   >   sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
  > >   >            lengthNorm(tqi.term.field, doc)
  > >   >
  > >   > rawScore(sbq, doc) = coord(sbq, doc) * baseScore
  > >   >
  > >   > norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
  > >   >
  > >   > score(sbq, doc) = rawScore * norm
  > >   >
  > >   > rawScore's can be computed in the Scorer.score() methods and
  > > therefore
  > >   > used to sort the hits (e.g., in the instance method for
collect()
  > > in
  > > the
  > >   > HitCollector in IndexSearcher.search()).  If the top scoring
hit
  > > does
  > >   > not have the highest baseScore, then its score could be less
that
  > > its
  > >   > coord; this seems desirable.  These formulas imply that no
result
  > > score
  > >   > can be larger than its coord, so if coord is well-defined
(always
  > >   > between 0 and 1) then all results will be normalized between 0
  > > and
  > > 1.
  > >   >
  > >   > In general, the netCoord, which takes the place of coord in
the
  > > simple
  > >   > case above, needs to be defined for any query.  Conceptually,
  > > this
  > >   > should be the total percentage of query terms matched by the
  > > document.
  > >   > It must be recursively computable from the subquery structure
and
  > >   > overridable for specific Query types (e.g., to support
  > > specialized
  > >   > coords, like one that is always 1.0 as is useful in
multi-field
  > >   > searching).  Suitable default definitions for TermQuery and
  > > BooleanQuery
  > >   > are:
  > >   >
  > >   > TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
  > >   >
  > >   > BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
  > > ci.netCoord
  > >   >
  > >   > This is not quite percentage of terms matched; e.g., consider
a
  > >   > BooleanQuery with two clauses, one of which is a BooleanQuery
of
  > > 3
  > > terms
  > >   > and the other which is just a term.  However, it doesn't seem
to
  > > be
  > >   > unreasonable for a BooleanQuery to state that its clauses are
  > > equally
  > >   > important, and this is consistent with the current coord
  > > behavior.
  > >   > BooleanQuery.netCoord could be overridden for special cases,
like
  > > the
  > >   > pure disjunction I use in my app for field expansions:
  > >   > MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
  > >   >
  > >   > Implementing this would proceed along these lines:
  > >   >   1.  For backwards compatibility, add some kind of newScoring
  > > boolean
  > >   > setting.
  > >   >   2.  Update all of these places to behave as indicated if
  > > newScoring:
  > >   >     a.  Change Query.weight() to not do any normalization (no
  > > call
  > > to
  > >   > sumOfSquaredWeights(), Similarity.queryNorm() or normalize()).
  > >   >     b.  Update all Query.weight classes to set their value
  > > according
  > > to
  > >   > the terms in the score formula above that don't involve the
  > > document
  > >   > (e.g., boost*idf for TermQuery).
  > >   >     c.  Add suitable netCoord definitions to all Scorer
classes.
  > >   >     d.  Update all Scorer.score() methods to compute the
rawScore
  > > as
  > >   > above.
  > >   >     e.  Add the maximum baseScore as a field kept on TopDocs,
  > > computed
  > >   > in the HitCollector's.
  > >   >     f.  Change the normalization in Hits to always divide
every
  > > raw
  > >   > score by the maximum baseScore.
  > >   >     g.  Update all of the current explain() methods to be
  > > consistent
  > >   > with this scoring, and to either report the rawScore, or to
  > > report
  > > the
  > >   > final score if the normalization factor is provided.
  > >   >     h.  Add Hits.explain() (or better extend Searcher so that
it
  > > keeps
  > >   > the Hits for use in Searcher.explain()) to call the new
explain
  > >   > variation with the normalization factor so that final scores
are
  > > fully
  > >   > explained.
  > >   >
  > >   > If this seems like a good idea, please let me know.  I'm sure
  > > there
  > > are
  > >   > details I've missed that would come out during coding and
  > > testing.
  > > Also,
  > >   > the value of this is dependent on how reasonable the final
scores
  > > look,
  > >   > which is hard to tell for sure until it is working.
  > >   >
  > >   > The coding standards for Lucene seem reasonably clear from the
  > > source
  > >   > code I've read.  I could use just simple Java so there
shouldn't
  > > be
  > > any
  > >   > significant JVM dependencies.  The above should be fully
backward
  > >   > compatible due to the newScoring flag.
  > >   >
  > >   > On another note, I had to remove the German analyzer in my
  > > current
  > > 1.4.2
  > >   > source configuration because GermanStemmer failed to compile
due
  > > to
  > > what
  > >   > are apparently Unicode character constants that I've now got
as
  > > illegal
  > >   > two-character character constants.  This is presumably an
  > > encoding
  > >   > problem somewhere that I could track down.  It's not
important,
  > > but
  > > if
  > >   > the answer is obvious to any of you, I'd appreciate the quick
  > > tip.
  > >   >
  > >   > Thanks,
  > >   >
  > >   > Chuck
  > >   >
  > >   >   > -----Original Message-----
  > >   >   > From: Doug Cutting [mailto:[EMAIL PROTECTED]
  > >   >   > Sent: Monday, October 18, 2004 9:44 AM
  > >   >   > To: Lucene Developers List
  > >   >   > Subject: Re: idf and explain(), was Re: Search and Scoring
  > >   >   >
  > >   >   > Chuck Williams wrote:
  > >   >   > > That's a good point on how the standard vector space
inner
  > > product
  > >   >   > > similarity measure does imply that the idf is squared
  > > relative
  > > to
  > >   > the
  > >   >   > > document tf.  Even having been aware of this formula for
a
  > > long
  > >   > time,
  > >   >   > > this particular implication never occurred to me.  Do
you
  > > know
  > > if
  > >   >   > > anybody has done precision/recall or other relevancy
  > > empirical
  > >   >   > > measurements comparing this vs. a model that does not
  > > square
  > > idf?
  > >   >   >
  > >   >   > No, not that I know of.
  > >   >   >
  > >   >   > > Regarding normalization, the normalization in Hits does
not
  > > have
  > >   > very
  > >   >   > > nice properties.  Due to the > 1.0 threshold check, it
  > > loses
  > >   >   > > information, and it arbitrarily defines the highest
scoring
  > > result
  > >   > in
  > >   >   > > any list that generates scores above 1.0 as a perfect
  > > match.
  > > It
  > >   > would
  > >   >   > > be nice if score values were meaningful independent of
  > > searches,
  > >   > e.g.,
  > >   >   > > if "0.6" meant the same quality of retrieval independent
of
  > > what
  > >   >   > search
  > >   >   > > was done.  This would allow, for example, sites to use a
a
  > > simple
  > >   >   > > quality threshold to only show results that were "good
  > > enough".
  > >   > At my
  > >   >   > > last company (I was President and head of engineering
for
  > > InQuira),
  > >   > we
  > >   >   > > found this to be important to many customers.
  > >   >   >
  > >   >   > If this is a big issue for you, as it seems it is, please
  > > submit
  > > a
  > >   > patch
  > >   >   > to optionally disable score normalization in Hits.java.
  > >   >   >
  > >   >   > > The standard vector space similarity measure includes
  > >   > normalization by
  > >   >   > > the product of the norms of the vectors, i.e.:
  > >   >   > >
  > >   >   > >   score(d,q) =  sum over t of ( weight(t,q) *
weight(t,d) )
  > > /
  > >   >   > >                 sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
  > >   >   > weight(t,d)^2) ]
  > >   >   > >
  > >   >   > > This makes the score a cosine, which since the values
are
  > > all
  > >   > positive,
  > >   >   > > forces it to be in [0, 1].  The sumOfSquares()
  > > normalization
  > > in
  > >   > Lucene
  > >   >   > > does not fully implement this.  Is there a specific
reason
  > > for
  > >   > that?
  > >   >   >
  > >   >   > The quantity 'sum(t) weight(t,d)^2' must be recomputed for
  > > each
  > >   > document
  > >   >   > each time a document is added to the collection, since
  > > 'weight(t,d)'
  > >   > is
  > >   >   > dependent on global term statistics.  This is prohibitivly
  > > expensive.
  > >   >   > Research has also demonstrated that such cosine
normalization
  > > gives
  > >   >   > somewhat inferior results (e.g., Singhal's pivoted length
  > >   > normalization).
  > >   >   >
  > >   >   > > Re. explain(), I don't see a downside to extending it
show
  > > the
  > >   > final
  > >   >   > > normalization in Hits.  It could still show the raw
score
  > > just
  > >   > prior
  > >   >   > to
  > >   >   > > that normalization.
  > >   >   >
  > >   >   > In order to normalize scores to 1.0 one must know the
maximum
  > > score.
  > >   >   > Explain only computes the score for a single document, and
  > > the
  > >   > maximum
  > >   >   > score is not known.
  > >   >   >
  > >   >   >  > Although I think it would be best to have a
  > >   >   >  > normalization that would render scores comparable
across
  > > searches.
  > >   >   >
  > >   >   > Then please submit a patch.  Lucene doesn't change on its
  > > own.
  > >   >   >
  > >   >   > Doug
  > >   >   >
  > >   >   >
  > >   >   >
  > >   >   >
  > >
--------------------------------------------------------------------
  > >   > -
  > >   >   > To unsubscribe, e-mail:
  > > [EMAIL PROTECTED]
  > >   >   > For additional commands, e-mail:
  > > [EMAIL PROTECTED]
  > >
  > >
  > >
---------------------------------------------------------------------
  > > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > > For additional commands, e-mail:
[EMAIL PROTECTED]
  > >
  > >
  > 
  > 
  >
---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Normalized Scoring -- was RE: idf and explain(), was Re: Search and Scoring

Reply via email to