Re: Study Group (WAS Re: Normalized Scoring)

David Spencer Mon, 07 Feb 2005 15:09:52 -0800

You might want to see a post I just made to the thread with this long subject:

"single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?"

I've done an example page that compares results of searching with different query parsers and Similarities.


Kelvin Tan wrote:

Wouldn't it be great if we can form a study-group of Lucene folks who want to take the 
"next step"? I feel uneasy posting non-Lucene specific questions to dev or user 
even if its related to IR.

Feels to me like there could be a couple like us, who didn't do a dissertation 
in IR, but would like a more indepth knowledge for practical purposes. 
Basically, the end result is that we are able to tune or extend lucene by using 
the Expert api (classes marked as Expert). Perhaps a possible outcome is a 
tuning tutorial for advanced users who already know how to use Lucene.

What do you think?

On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:

Exactly.  Luckily, since then I've learned a bit from lucene-dev
discussions and side IR readings, so some of the topics are making
more sense now.

Otis

--- Kelvin Tan <[EMAIL PROTECTED]> wrote:

Hi Otis, I was re-reading this whole theoretical thread about
idf, scoring, normalization, etc from last Oct and couldn't help
laughing out loud when I read your post, coz it summed up what I
was thinking the whole time. I think its really great to have
people like Chuck and Paul (Eshlot) around. I'm learning so much.

On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:

Hi Chuck,

The relative lack of responses is not because there is no
interest, but probably because there are only a few people on
lucene-dev who can follow/understand every detail of your
proposal.  I understand and hear you, but I have a hard time
'visualizing' some of the formulas in your proposal.  What you
are saying sounds right to me, but I don't have enough
theoretical knowledge to go one way or the other.

Otis


--- Chuck Williams <[EMAIL PROTECTED]> wrote:

Hi everybody,

Although there doesn't seem to be much interest in this I
have one significant improvement to the below and a couple
observations that clarify the situation.

To illustrate the problem better normalization is intended to
address,
in my current application for BooleanQuery's of two terms, I
frequently
get a top score of 1.0 when only one of the terms is matched.
1.0 should indicate a "perfect match".  I'd like set my UI up
to present the
results differently depending on how good the different
results are (e.g., showing a visual indication of result
quality, dropping the really bad results entirely, and
segregating the good results from other
only vaguely relevant results).  To build this kind of
"intelligence" into the UI, I certainly need to know whether
my top result matched all
query terms, or only half of them.  I'd like to have the
score tell me
directly how good the matches are.  The current normalization
does not achieve this.

The intent of a new normalization scheme is to preserve the
current scoring in the following sense:  the ratio of the nth
result's score to
the best result's score remains the same.  Therefore, the
only question
is what normalization factor to apply to all scores.  This
reduces to a
very specific question that determines the entire
normalization -- what should the score of the best matching
result be?

The mechanism below has this property, i.e. it keeps the
current score
ratios, except that I removed one idf term for reasons
covered earlier
(this better reflects the current empirically best scoring
algorithms).
However, removing an idf is a completely separate issue.  The
improved
normalization is independent of whether or not that change is
made.

For the central question of what the top score should be,
upon reflection, I don't like the definition below.  It
defined the top score
as (approximately) the percentage of query terms matched by
the top scoring result.  A better conceptual definition is to
use a weighted average based on the boosts.  I.e., downward
propagate all boosts to the
underlying terms (or phrases).  Secifically, the "net boost"
of a term
is the product of the direct boost of the term and all boosts
applied to
encompassing clauses.  Then the score of the top result
becomes the sum
of the net boosts of its matching terms divided by the sum of
the net boosts of all query terms.

This definition is a refinement of the original proposal
below, and it
could probably be further refined if some impact of the tf,
idf and/or
lengthNorm was desired in determining the top score.  These
other factors seems to be harder to normalize for, although
I've thought of some simple approaches; e.g., assume the
unmatched terms in the top result have values for these three
factors that are the average of the
values of the matched terms, then apply exactly the same
concept of actual score divided by theorectical maximum
score.  That would eliminate any need to maintain maximum
value statistics in the index.

As an example of the simple boost-based normalization, for
the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b --
> 3 c --

1 d --> 2


So if a and b matched, but not c and d, in the top scoring
result, its
score would be 0.75.  The normalizer would be 0.75/(current
score except
for the current normalization).  This normalizer would be
applied to all
current scores (minus normalization) to create the normalized
scores.

For simple query (a b), if only one of the terms matched in
the top result, then its score would be 0.5, vs. 1.0 or many
other possible scores today.

In addition to enabling more "intelligent" UI's that
communicate the quality of results to end-users, the proposal
below also extends the explain() mechanism to fully explain
the final normalized score. However, that change is also
independent -- it could be done with the current scoring.

Am I the only one who would like to see better normalization
in Lucene? Does anybody have a better approach?

If you've read this far, thanks for indulging me on this.  I
would love
to see this or something with similar properties in Lucene.
I really just want to build my app, but as stated below would
write and contribute this if there is interest in putting it
in, and nobody else
wants to write it.  Please let me know what you think one way
or the other.

Thanks,

Chuck

-----Original Message-----
From: Chuck Williams
Sent: Monday, October 18, 2004 7:04 PM
To: 'Lucene Developers List'
Subject: RE: idf and explain(), was Re: Search and Scoring

Doug Cutting wrote:

If this is a big issue for you, as it seems it is, please


submit
a

patch

to optionally disable score normalization in Hits.java.


and:

The quantity 'sum(t) weight(t,d)^2' must be recomputed for


each

document

each time a document is added to the collection, since


'weight(t,d)'

is

dependent on global term statistics.  This is prohibitivly


expensive.

Research has also demonstrated that such cosine
normalization


gives

somewhat inferior results (e.g., Singhal's pivoted length


normalization).

I'm willing to write, test and contribute code to address
the normalization issue, i.e. to yield scores in [0, 1]
that are


meaningful

across searches.  Unfortunately, this is considerably more


involved
that

just optionally eliminating the current normalization in
Hits.


Before

undertaking this, I'd like to see if there is agreement in


principle

that this is a good idea, and that my specific proposal
below is

the

right way to go.  Also, I'd like to make sure I've correctly


inferred

the constraints on writing code to be incorporated into
Lucene.

After looking at this in more detail I agree that the
cosine normalization is not the way to go, because of both
efficiency

and

effectiveness considerations.  A conceptual approach that
would

be

efficient, relatively easy to implement, and seems to have
at


least

reasonable behavior would be to define the top scoring
match to


have
a

score roughly equal to the percentage of query terms it
matches


(its

"netCoord").  Scores below the top hit would be reduced
based on

the

ratio of their raw scores to the raw score of the top hit


(considering

all of the current score factors, except that I'd like to
remove


one
of

the idf factors, as discussed earlier).

For a couple simple cases:
a) the top match for a single term query would always have a


score
of

1.0,
b) the top scoring match for a BooleanQuery using


DefaultSimilarity

with all non-prohibited TermQuery clauses would have a
score of


m/n,

where the hit matches m of the n terms.

This isn't optimal, but seems much better than the current


situation.

Consider two single-term queries, s and t.  If s matches
more


strongly

than t in its top hit (e.g., a higher tf in a shorter
field), it


would

be best if the top score of s was greater score than top
score of

t.

But this kind of normalization would require keeping
additional statistics that so far as I know are not
currently in the index,


like

the maximum tf for terms and the minimum length for fields.


These
could

be expensive to update on deletes.  Also, normalizing by
such


factors

could yield lower than subjectively reasonable scores in
most


cases,
so

it's not clear it would be better.

The semantics above are at least clean, easy to understand,
and


support

what seems to me is the most important motivation to do
this:


allowing

an application to use simple thresholding to segregate


likely-to-be-

relevant hits from likely-to-be-irrelevant hits.

More specifically, for a BooleanQuery of TermQuery's the


resulting
score

functions would be:

BooleanQuery of TermQuery's sbq = (tq1 ... tqn)

baseScore(sbq, doc) =
sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)*
lengthNorm(tqi.term.field, doc)

rawScore(sbq, doc) = coord(sbq, doc) * baseScore

norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)

score(sbq, doc) = rawScore * norm

rawScore's can be computed in the Scorer.score() methods and


therefore

used to sort the hits (e.g., in the instance method for
collect()


in
the

HitCollector in IndexSearcher.search()).  If the top
scoring hit


does

not have the highest baseScore, then its score could be
less that

its

coord; this seems desirable.  These formulas imply that no
result


score

can be larger than its coord, so if coord is well-defined
(always between 0 and 1) then all results will be
normalized between 0


and
1.

In general, the netCoord, which takes the place of coord in
the


simple

case above, needs to be defined for any query.
Conceptually,


this

should be the total percentage of query terms matched by the


document.

It must be recursively computable from the subquery
structure and overridable for specific Query types (e.g.,
to support


specialized

coords, like one that is always 1.0 as is useful in multi-
field searching).  Suitable default definitions for
TermQuery and


BooleanQuery

are:

TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise

BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *


ci.netCoord

This is not quite percentage of terms matched; e.g.,
consider a BooleanQuery with two clauses, one of which is a
BooleanQuery of


3
terms

and the other which is just a term.  However, it doesn't
seem to

be

unreasonable for a BooleanQuery to state that its clauses
are


equally

important, and this is consistent with the current coord


behavior.

BooleanQuery.netCoord could be overridden for special
cases, like

the

pure disjunction I use in my app for field expansions:
MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord

Implementing this would proceed along these lines: 1.  For
backwards compatibility, add some kind of newScoring


boolean

setting.
2.  Update all of these places to behave as indicated if


newScoring:

a.  Change Query.weight() to not do any normalization (no


call
to

sumOfSquaredWeights(), Similarity.queryNorm() or
normalize()). b.  Update all Query.weight classes to set
their value


according
to

the terms in the score formula above that don't involve the


document

(e.g., boost*idf for TermQuery).
c.  Add suitable netCoord definitions to all Scorer
classes. d. Update all Scorer.score() methods to compute
the rawScore

as

above.
e.  Add the maximum baseScore as a field kept on TopDocs,


computed

in the HitCollector's.
f.  Change the normalization in Hits to always divide every

raw

score by the maximum baseScore.
g.  Update all of the current explain() methods to be


consistent

with this scoring, and to either report the rawScore, or to


report
the

final score if the normalization factor is provided. h.
Add Hits.explain() (or better extend Searcher so that it


keeps

the Hits for use in Searcher.explain()) to call the new
explain variation with the normalization factor so that
final scores are


fully

explained.

If this seems like a good idea, please let me know.  I'm
sure


there
are

details I've missed that would come out during coding and


testing.
Also,

the value of this is dependent on how reasonable the final
scores


look,

which is hard to tell for sure until it is working.

The coding standards for Lucene seem reasonably clear from
the


source

code I've read.  I could use just simple Java so there
shouldn't


be
any

significant JVM dependencies.  The above should be fully
backward compatible due to the newScoring flag.

On another note, I had to remove the German analyzer in my


current
1.4.2

source configuration because GermanStemmer failed to
compile due


to
what

are apparently Unicode character constants that I've now
got as


illegal

two-character character constants.  This is presumably an


encoding

problem somewhere that I could track down.  It's not
important,


but
if

the answer is obvious to any of you, I'd appreciate the
quick


tip.

Thanks,

Chuck

-----Original Message-----
From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent:
Monday, October 18, 2004 9:44 AM To: Lucene Developers
List Subject: Re: idf and explain(), was Re: Search and
Scoring

Chuck Williams wrote:

That's a good point on how the standard vector space
inner


product

similarity measure does imply that the idf is squared


relative
to

the

document tf.  Even having been aware of this formula
for a


long

time,

this particular implication never occurred to me.  Do
you


know
if

anybody has done precision/recall or other relevancy


empirical

measurements comparing this vs. a model that does not


square
idf?

No, not that I know of.

Regarding normalization, the normalization in Hits does
not


have

very

nice properties.  Due to the > 1.0 threshold check, it


loses

information, and it arbitrarily defines the highest
scoring


result

in

any list that generates scores above 1.0 as a perfect


match.
It

would

be nice if score values were meaningful independent of


searches,

e.g.,

if "0.6" meant the same quality of retrieval
independent of


what

search

was done.  This would allow, for example, sites to use
a a


simple

quality threshold to only show results that were "good


enough".

At my

last company (I was President and head of engineering
for


InQuira),

we

found this to be important to many customers.


If this is a big issue for you, as it seems it is, please


submit
a

patch

to optionally disable score normalization in Hits.java.

The standard vector space similarity measure includes


normalization by

the product of the norms of the vectors, i.e.:

score(d,q) =  sum over t of ( weight(t,q) * weight(t,d)
)

sqrt [ (sum(t) weight(t,q)^2) * (sum(t)


weight(t,d)^2) ]

This makes the score a cosine, which since the values
are

all

positive,

forces it to be in [0, 1].  The sumOfSquares()


normalization
in

Lucene

does not fully implement this.  Is there a specific
reason

for

that?

The quantity 'sum(t) weight(t,d)^2' must be recomputed for


each

document

each time a document is added to the collection, since


'weight(t,d)'

is

dependent on global term statistics.  This is prohibitivly


expensive.

Research has also demonstrated that such cosine
normalization


gives

somewhat inferior results (e.g., Singhal's pivoted length


normalization).

Re. explain(), I don't see a downside to extending it
show

the

final

normalization in Hits.  It could still show the raw
score


just

prior

to

that normalization.


In order to normalize scores to 1.0 one must know the
maximum


score.

Explain only computes the score for a single document, and

the

maximum

score is not known.

Although I think it would be best to have a
normalization that would render scores comparable across


searches.

Then please submit a patch.  Lucene doesn't change on its


own.

Doug


--------------------------------------------------------------
---- --

-
To unsubscribe, e-mail:


[EMAIL PROTECTED]

For additional commands, e-mail:


[EMAIL PROTECTED]


--------------------------------------------------------------
---- --- To unsubscribe, e-mail: lucene-dev-
[EMAIL PROTECTED] For additional commands, e-
mail: [EMAIL PROTECTED]

--------------------------------------------------------------------

- To unsubscribe, e-mail: lucene-dev-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]

------------------------------------------------------------------
--- To unsubscribe, e-mail: lucene-dev-
[EMAIL PROTECTED] For additional commands, e-mail:
[EMAIL PROTECTED]

--------------------------------------------------------------------
- To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Study Group (WAS Re: Normalized Scoring)

Reply via email to