Christoph Goller writes: > Chuck Williams schrieb: > > Christoph Goller writes: > > > My intention was to (ab-)use query boosts for idf transmission and > to > > > overwrite Similarity so that local idf is ignored. The idea was to > > > simply multiply global idf into the given boost. Unfortunately idf > is > > > not only used with the boosts and query normalization. It also > occurs > > > in the document part of the scoring algorithm. If you look into > > > TermWeight.normalize(float queryNorm) there is an additional > > > multiplication with idf. The same holds for PhraseWeight. So my > idea > > > probably does not work :-( > > > > This is not a problem for at least one reason, and I argue two > reasons: > > 1. The idf factor in the document part of the scoring algorithm is > precisely the same quantity as that in the query part. I.e., for every > term in the query, idf^2 is multiplied into the score. Rewriting this > factor into the boost associated with the term instead of the weight can > be made consistent with current scoring by simply squaring idf. > > Unfortunately not. One idf together with the query boost is used for > normalization based on the query norm. This is the idf belonging to the > query. The other idf belongs to the document vector and therefore does > not > go into the normalization.
Actually, the normalize is a third idf factor (in a different form, square-rooted in the denominator and summed). I.e., for a simple BoolanQuery: score(query, doc) = coord*queryNorm* sum[ term in query : idf(term)*boost(term)*idf(term)*tf(term, doc)*docNorm(doc) ] where queryNorm = 1/sum[ term in query : (boost(term)*idf(term))^2 ] So, only the Scorer terms tf(term, doc) and docNorm(doc) depend on the doc. The result of the computation only depends on the boosts and idf's, and so can be computed by MultiSearcher augmented with a global idf table. I.e., to be explicit, the queryNorm could also be factored into the boost if that implementation is desired. The MultiSearcher boost could be all terms in the formula above except for tf(term,doc)*docNorm(doc). However, there may be one problem with this approach. It loses information that might be necessary for a proposal of mine, which is to fix Lucene's normalization (again discussed ad nauseum on an earlier thread). I'm not sure whether that algorithm could be done in concert with the boost-based MultiSearcher rewriting approach (and am also not sure it couldn't). Re. idf^2, it's the squaring in the numerator that I think is bogus: > I remember this discussion. I also took part a little bit :-) > You may be right. But I am not completely convinced. I think > this should be decided based on the proposed benchmark evaluation. Is that still happening? Chuck --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]