RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Thu, 27 Jan 2005 10:50:46 -0800

Christoph Goller writes:
  > Chuck Williams schrieb:
  > > Christoph Goller writes:
  > >   > My intention was to (ab-)use query boosts for idf transmission
and
  > to
  > >   > overwrite Similarity so that local idf is ignored. The idea
was to
  > >   > simply multiply global idf into the given boost. Unfortunately
idf
  > is
  > >   > not only used with the boosts and query normalization. It also
  > occurs
  > >   > in the document part of the scoring algorithm. If you look
into
  > >   > TermWeight.normalize(float queryNorm) there is an additional
  > >   > multiplication with idf. The same holds for PhraseWeight. So
my
  > idea
  > >   > probably does not work :-(
  > >
  > > This is not a problem for at least one reason, and I argue two
  > reasons:
  > >   1.  The idf factor in the document part of the scoring algorithm
is
  > precisely the same quantity as that in the query part.  I.e., for
every
  > term in the query, idf^2 is multiplied into the score.  Rewriting
this
  > factor into the boost associated with the term instead of the weight
can
  > be made consistent with current scoring by simply squaring idf.
  > 
  > Unfortunately not. One idf together with the query boost is used for
  > normalization based on the query norm. This is the idf belonging to
the
  > query. The other idf belongs to the document vector and therefore
does
  > not
  > go into the normalization.


Actually, the normalize is a third idf factor (in a different form,
square-rooted in the denominator and summed).

I.e., for a simple BoolanQuery:

score(query, doc) =
  coord*queryNorm*
    sum[ term in query : 
         idf(term)*boost(term)*idf(term)*tf(term, doc)*docNorm(doc)
       ]

where queryNorm = 1/sum[ term in query : (boost(term)*idf(term))^2 ]

So, only the Scorer terms tf(term, doc) and docNorm(doc) depend on the
doc.  The result of the computation only depends on the boosts and
idf's, and so can be computed by MultiSearcher augmented with a global
idf table.

I.e., to be explicit, the queryNorm could also be factored into the
boost if that implementation is desired.  The MultiSearcher boost could
be all terms in the formula above except for tf(term,doc)*docNorm(doc).

However, there may be one problem with this approach.  It loses
information that might be necessary for a proposal of mine, which is to
fix Lucene's normalization (again discussed ad nauseum on an earlier
thread).  I'm not sure whether that algorithm could be done in concert
with the boost-based MultiSearcher rewriting approach (and am also not
sure it couldn't).

Re. idf^2, it's the squaring in the numerator that I think is bogus:

  > I remember this discussion. I also took part a little bit :-)
  > You may be right. But I am not completely convinced. I think
  > this should be decided based on the proposed benchmark evaluation.

Is that still happening?

Chuck


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to