Re: idf and explain(), was Re: Search and Scoring

Doug Cutting Mon, 18 Oct 2004 09:47:22 -0700

Chuck Williams wrote:

That's a good point on how the standard vector space inner product
similarity measure does imply that the idf is squared relative to the
document tf.  Even having been aware of this formula for a long time,
this particular implication never occurred to me.  Do you know if
anybody has done precision/recall or other relevancy empirical
measurements comparing this vs. a model that does not square idf?


No, not that I know of.

Regarding normalization, the normalization in Hits does not have very
nice properties.  Due to the > 1.0 threshold check, it loses
information, and it arbitrarily defines the highest scoring result in
any list that generates scores above 1.0 as a perfect match.  It would
be nice if score values were meaningful independent of searches, e.g.,
if "0.6" meant the same quality of retrieval independent of what search
was done.  This would allow, for example, sites to use a a simple
quality threshold to only show results that were "good enough".  At my
last company (I was President and head of engineering for InQuira), we
found this to be important to many customers.

If this is a big issue for you, as it seems it is, please submit a patch to optionally disable score normalization in Hits.java.

The standard vector space similarity measure includes normalization by
the product of the norms of the vectors, i.e.:

  score(d,q) =  sum over t of ( weight(t,q) * weight(t,d) ) /
                sqrt [ (sum(t) weight(t,q)^2) * (sum(t) weight(t,d)^2) ]

This makes the score a cosine, which since the values are all positive,
forces it to be in [0, 1].  The sumOfSquares() normalization in Lucene
does not fully implement this.  Is there a specific reason for that?

The quantity 'sum(t) weight(t,d)^2' must be recomputed for each document each time a document is added to the collection, since 'weight(t,d)' is dependent on global term statistics. This is prohibitivly expensive. Research has also demonstrated that such cosine normalization gives somewhat inferior results (e.g., Singhal's pivoted length normalization).

Re. explain(), I don't see a downside to extending it show the final
normalization in Hits.  It could still show the raw score just prior to
that normalization.

In order to normalize scores to 1.0 one must know the maximum score. Explain only computes the score for a single document, and the maximum score is not known.

> Although I think it would be best to have a
> normalization that would render scores comparable across searches.

Then please submit a patch.  Lucene doesn't change on its own.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: idf and explain(), was Re: Search and Scoring

Reply via email to