You might want to see a post I just made to the thread with this long subject:
"single field code ready - Re: URL to compare 2 Similarity's ready-- Re: Scoring benchmark evaluation. Was RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?"
I've done an example page that compares results of searching with different query parsers and Similarities.
Kelvin Tan wrote:
Wouldn't it be great if we can form a study-group of Lucene folks who want to take the "next step"? I feel uneasy posting non-Lucene specific questions to dev or user even if its related to IR.
Feels to me like there could be a couple like us, who didn't do a dissertation in IR, but would like a more indepth knowledge for practical purposes. Basically, the end result is that we are able to tune or extend lucene by using the Expert api (classes marked as Expert). Perhaps a possible outcome is a tuning tutorial for advanced users who already know how to use Lucene.
What do you think?
k
On Sat, 5 Feb 2005 22:10:26 -0800 (PST), Otis Gospodnetic wrote:
Exactly. Luckily, since then I've learned a bit from lucene-dev discussions and side IR readings, so some of the topics are making more sense now.
Otis
--- Kelvin Tan <[EMAIL PROTECTED]> wrote:
Hi Otis, I was re-reading this whole theoretical thread about idf, scoring, normalization, etc from last Oct and couldn't help laughing out loud when I read your post, coz it summed up what I was thinking the whole time. I think its really great to have people like Chuck and Paul (Eshlot) around. I'm learning so much.
k
On Thu, 21 Oct 2004 10:05:51 -0700 (PDT), Otis Gospodnetic wrote:
Hi Chuck,
The relative lack of responses is not because there is no interest, but probably because there are only a few people on lucene-dev who can follow/understand every detail of your proposal. I understand and hear you, but I have a hard time 'visualizing' some of the formulas in your proposal. What you are saying sounds right to me, but I don't have enough theoretical knowledge to go one way or the other.
Otis
--- Chuck Williams <[EMAIL PROTECTED]> wrote:
Hi everybody,
Although there doesn't seem to be much interest in this I have one significant improvement to the below and a couple observations that clarify the situation.
To illustrate the problem better normalization is intended to address, in my current application for BooleanQuery's of two terms, I frequently get a top score of 1.0 when only one of the terms is matched. 1.0 should indicate a "perfect match". I'd like set my UI up to present the results differently depending on how good the different results are (e.g., showing a visual indication of result quality, dropping the really bad results entirely, and segregating the good results from other only vaguely relevant results). To build this kind of "intelligence" into the UI, I certainly need to know whether my top result matched all query terms, or only half of them. I'd like to have the score tell me directly how good the matches are. The current normalization does not achieve this.
The intent of a new normalization scheme is to preserve the current scoring in the following sense: the ratio of the nth result's score to the best result's score remains the same. Therefore, the only question is what normalization factor to apply to all scores. This reduces to a very specific question that determines the entire normalization -- what should the score of the best matching result be?
The mechanism below has this property, i.e. it keeps the current score ratios, except that I removed one idf term for reasons covered earlier (this better reflects the current empirically best scoring algorithms). However, removing an idf is a completely separate issue. The improved normalization is independent of whether or not that change is made.
For the central question of what the top score should be, upon reflection, I don't like the definition below. It defined the top score as (approximately) the percentage of query terms matched by the top scoring result. A better conceptual definition is to use a weighted average based on the boosts. I.e., downward propagate all boosts to the underlying terms (or phrases). Secifically, the "net boost" of a term is the product of the direct boost of the term and all boosts applied to encompassing clauses. Then the score of the top result becomes the sum of the net boosts of its matching terms divided by the sum of the net boosts of all query terms.
This definition is a refinement of the original proposal below, and it could probably be further refined if some impact of the tf, idf and/or lengthNorm was desired in determining the top score. These other factors seems to be harder to normalize for, although I've thought of some simple approaches; e.g., assume the unmatched terms in the top result have values for these three factors that are the average of the values of the matched terms, then apply exactly the same concept of actual score divided by theorectical maximum score. That would eliminate any need to maintain maximum value statistics in the index.
As an example of the simple boost-based normalization, for the query ((a^2 b)^3 (c d^2)) the net boosts are: a --> 6 b -- > 3 c --
1 d --> 2
So if a and b matched, but not c and d, in the top scoring result, its score would be 0.75. The normalizer would be 0.75/(current score except for the current normalization). This normalizer would be applied to all current scores (minus normalization) to create the normalized scores.
For simple query (a b), if only one of the terms matched in the top result, then its score would be 0.5, vs. 1.0 or many other possible scores today.
In addition to enabling more "intelligent" UI's that communicate the quality of results to end-users, the proposal below also extends the explain() mechanism to fully explain the final normalized score. However, that change is also independent -- it could be done with the current scoring.
Am I the only one who would like to see better normalization in Lucene? Does anybody have a better approach?
If you've read this far, thanks for indulging me on this. I would love to see this or something with similar properties in Lucene. I really just want to build my app, but as stated below would write and contribute this if there is interest in putting it in, and nobody else wants to write it. Please let me know what you think one way or the other.
Thanks,
Chuck
-----Original Message----- From: Chuck Williams Sent: Monday, October 18, 2004 7:04 PM To: 'Lucene Developers List' Subject: RE: idf and explain(), was Re: Search and Scoring
Doug Cutting wrote:
If this is a big issue for you, as it seems it is, please
submit a
patch
to optionally disable score normalization in Hits.java.
and:
The quantity 'sum(t) weight(t,d)^2' must be recomputed for
each
document
each time a document is added to the collection, since
'weight(t,d)'
is
dependent on global term statistics. This is prohibitivly
expensive.
Research has also demonstrated that such cosine normalization
gives
somewhat inferior results (e.g., Singhal's pivoted length
normalization).
I'm willing to write, test and contribute code to address the normalization issue, i.e. to yield scores in [0, 1] that are
meaningful
across searches. Unfortunately, this is considerably more
involved that
just optionally eliminating the current normalization in Hits.
Before
undertaking this, I'd like to see if there is agreement in
principle
that this is a good idea, and that my specific proposal below is
the
right way to go. Also, I'd like to make sure I've correctly
inferred
the constraints on writing code to be incorporated into Lucene.
After looking at this in more detail I agree that the cosine normalization is not the way to go, because of both efficiency
and
effectiveness considerations. A conceptual approach that would
be
efficient, relatively easy to implement, and seems to have at
least
reasonable behavior would be to define the top scoring match to
have a
score roughly equal to the percentage of query terms it matches
(its
"netCoord"). Scores below the top hit would be reduced based on
the
ratio of their raw scores to the raw score of the top hit
(considering
all of the current score factors, except that I'd like to remove
one of
the idf factors, as discussed earlier).
For a couple simple cases: a) the top match for a single term query would always have a
score of
1.0, b) the top scoring match for a BooleanQuery using
DefaultSimilarity
with all non-prohibited TermQuery clauses would have a score of
m/n,
where the hit matches m of the n terms.
This isn't optimal, but seems much better than the current
situation.
Consider two single-term queries, s and t. If s matches more
strongly
than t in its top hit (e.g., a higher tf in a shorter field), it
would
be best if the top score of s was greater score than top score of
t.
But this kind of normalization would require keeping additional statistics that so far as I know are not currently in the index,
like
the maximum tf for terms and the minimum length for fields.
These could
be expensive to update on deletes. Also, normalizing by such
factors
could yield lower than subjectively reasonable scores in most
cases, so
it's not clear it would be better.
The semantics above are at least clean, easy to understand, and
support
what seems to me is the most important motivation to do this:
allowing
an application to use simple thresholding to segregate
likely-to-be-
relevant hits from likely-to-be-irrelevant hits.
More specifically, for a BooleanQuery of TermQuery's the
resulting score
functions would be:
BooleanQuery of TermQuery's sbq = (tq1 ... tqn)
baseScore(sbq, doc) = sum(tqi) boost(tqi)*idf(tqi.term)*tf(tqi.term, doc)* lengthNorm(tqi.term.field, doc)
rawScore(sbq, doc) = coord(sbq, doc) * baseScore
norm(sbq, hits) = 1 / max(hit in hits) baseScore(sbq, hit)
score(sbq, doc) = rawScore * norm
rawScore's can be computed in the Scorer.score() methods and
therefore
used to sort the hits (e.g., in the instance method for collect()
in the
HitCollector in IndexSearcher.search()). If the top scoring hit
does
not have the highest baseScore, then its score could be less that
its
coord; this seems desirable. These formulas imply that no result
score
can be larger than its coord, so if coord is well-defined (always between 0 and 1) then all results will be normalized between 0
and 1.
In general, the netCoord, which takes the place of coord in the
simple
case above, needs to be defined for any query. Conceptually,
this
should be the total percentage of query terms matched by the
document.
It must be recursively computable from the subquery structure and overridable for specific Query types (e.g., to support
specialized
coords, like one that is always 1.0 as is useful in multi- field searching). Suitable default definitions for TermQuery and
BooleanQuery
are:
TermQuery.netCoord = 1.0 if term matches, 0.0 otherwise
BooleanQuery(c1 ... cn).netCoord = sum(ci) coord(1, n) *
ci.netCoord
This is not quite percentage of terms matched; e.g., consider a BooleanQuery with two clauses, one of which is a BooleanQuery of
3 terms
and the other which is just a term. However, it doesn't seem to
be
unreasonable for a BooleanQuery to state that its clauses are
equally
important, and this is consistent with the current coord
behavior.
BooleanQuery.netCoord could be overridden for special cases, like
the
pure disjunction I use in my app for field expansions: MaxDisjunctionQuery(c1 .. cn).netCoord = max(ci) ci.netCoord
Implementing this would proceed along these lines: 1. For backwards compatibility, add some kind of newScoring
boolean
setting. 2. Update all of these places to behave as indicated if
newScoring:
a. Change Query.weight() to not do any normalization (no
call to
sumOfSquaredWeights(), Similarity.queryNorm() or normalize()). b. Update all Query.weight classes to set their value
according to
the terms in the score formula above that don't involve the
document
(e.g., boost*idf for TermQuery). c. Add suitable netCoord definitions to all Scorer classes. d. Update all Scorer.score() methods to compute the rawScore
as
above. e. Add the maximum baseScore as a field kept on TopDocs,
computed
in the HitCollector's. f. Change the normalization in Hits to always divide every
raw
score by the maximum baseScore. g. Update all of the current explain() methods to be
consistent
with this scoring, and to either report the rawScore, or to
report the
final score if the normalization factor is provided. h. Add Hits.explain() (or better extend Searcher so that it
keeps
the Hits for use in Searcher.explain()) to call the new explain variation with the normalization factor so that final scores are
fully
explained.
If this seems like a good idea, please let me know. I'm sure
there are
details I've missed that would come out during coding and
testing. Also,
the value of this is dependent on how reasonable the final scores
look,
which is hard to tell for sure until it is working.
The coding standards for Lucene seem reasonably clear from the
source
code I've read. I could use just simple Java so there shouldn't
be any
significant JVM dependencies. The above should be fully backward compatible due to the newScoring flag.
On another note, I had to remove the German analyzer in my
current 1.4.2
source configuration because GermanStemmer failed to compile due
to what
are apparently Unicode character constants that I've now got as
illegal
two-character character constants. This is presumably an
encoding
problem somewhere that I could track down. It's not important,
but if
the answer is obvious to any of you, I'd appreciate the quick
tip.
Thanks,
Chuck
-----Original Message----- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Monday, October 18, 2004 9:44 AM To: Lucene Developers List Subject: Re: idf and explain(), was Re: Search and Scoring
Chuck Williams wrote:
That's a good point on how the standard vector space inner
product
similarity measure does imply that the idf is squared
relative to
the
document tf. Even having been aware of this formula for a
long
time,
this particular implication never occurred to me. Do you
know if
anybody has done precision/recall or other relevancy
empirical
measurements comparing this vs. a model that does not
square idf?
No, not that I know of.
Regarding normalization, the normalization in Hits does not
have
very
nice properties. Due to the > 1.0 threshold check, it
loses
information, and it arbitrarily defines the highest scoring
result
in
any list that generates scores above 1.0 as a perfect
match. It
would
be nice if score values were meaningful independent of
searches,
e.g.,
if "0.6" meant the same quality of retrieval independent of
what
search
was done. This would allow, for example, sites to use a a
simple
quality threshold to only show results that were "good
enough".
At my
last company (I was President and head of engineering for
InQuira),
we
found this to be important to many customers.
If this is a big issue for you, as it seems it is, please
submit a
patch
to optionally disable score normalization in Hits.java.
The standard vector space similarity measure includes
normalization by
the product of the norms of the vectors, i.e.:
score(d,q) = sum over t of ( weight(t,q) * weight(t,d) )
/
sqrt [ (sum(t) weight(t,q)^2) * (sum(t)
weight(t,d)^2) ]
This makes the score a cosine, which since the values are
all
positive,
forces it to be in [0, 1]. The sumOfSquares()
normalization in
Lucene
does not fully implement this. Is there a specific reason
for
that?
The quantity 'sum(t) weight(t,d)^2' must be recomputed for
each
document
each time a document is added to the collection, since
'weight(t,d)'
is
dependent on global term statistics. This is prohibitivly
expensive.
Research has also demonstrated that such cosine normalization
gives
somewhat inferior results (e.g., Singhal's pivoted length
normalization).
Re. explain(), I don't see a downside to extending it show
the
final
normalization in Hits. It could still show the raw score
just
prior
to
that normalization.
In order to normalize scores to 1.0 one must know the maximum
score.
Explain only computes the score for a single document, and
the
maximum
score is not known.
Although I think it would be best to have a normalization that would render scores comparable across
searches.
Then please submit a patch. Lucene doesn't change on its
own.
Doug
-------------------------------------------------------------- ---- --
-
To unsubscribe, e-mail:
[EMAIL PROTECTED]
For additional commands, e-mail:
[EMAIL PROTECTED]
-------------------------------------------------------------- ---- --- To unsubscribe, e-mail: lucene-dev- [EMAIL PROTECTED] For additional commands, e- mail: [EMAIL PROTECTED]
--------------------------------------------------------------------
- To unsubscribe, e-mail: lucene-dev- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
------------------------------------------------------------------ --- To unsubscribe, e-mail: lucene-dev- [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
-------------------------------------------------------------------- - To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]