Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Christoph Goller Thu, 27 Jan 2005 10:16:38 -0800

Chuck Williams schrieb:

Christoph Goller writes:
  > My intention was to (ab-)use query boosts for idf transmission and to
  > overwrite Similarity so that local idf is ignored. The idea was to
  > simply multiply global idf into the given boost. Unfortunately idf is
  > not only used with the boosts and query normalization. It also occurs
  > in the document part of the scoring algorithm. If you look into
  > TermWeight.normalize(float queryNorm) there is an additional
  > multiplication with idf. The same holds for PhraseWeight. So my idea
  > probably does not work :-(

This is not a problem for at least one reason, and I argue two reasons:
  1.  The idf factor in the document part of the scoring algorithm is precisely 
the same quantity as that in the query part.  I.e., for every term in the 
query, idf^2 is multiplied into the score.  Rewriting this factor into the 
boost associated with the term instead of the weight can be made consistent 
with current scoring by simply squaring idf.


Unfortunately not. One idf together with the query boost is used for
normalization based on the query norm. This is the idf belonging to the
query. The other idf belongs to the document vector and therefore does not
go into the normalization.

In the current scoring algorithm if we have a simple TermQuery,
(query) normalization will factor out query boost and idf coming from the
query vector. However, the idf from the document vector will remain. This
means that two terms with the same tf in a document will get different
scores if their idf differs.

2. Squaring idf is almost certainly a bad idea anyway. There was a separate thread on this topic a while back. In short, although Salton squared idf in his original vector space formula, even he dropped one idf term later on empricial grounds; the best scoring algorithms today do not square idf; the recent theoretical justifications for td*idf demonstrate why idf should not be squared. Those topics are elaborated fully in the earlier thread.


I remember this discussion. I also took part a little bit :-)
You may be right. But I am not completely convinced. I think
this should be decided based on the proposed benchmark evaluation.

Christoph

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to