RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Chuck Williams Thu, 27 Jan 2005 09:20:18 -0800

Christoph Goller writes:
  > My intention was to (ab-)use query boosts for idf transmission and to
  > overwrite Similarity so that local idf is ignored. The idea was to
  > simply multiply global idf into the given boost. Unfortunately idf is
  > not only used with the boosts and query normalization. It also occurs
  > in the document part of the scoring algorithm. If you look into
  > TermWeight.normalize(float queryNorm) there is an additional
  > multiplication with idf. The same holds for PhraseWeight. So my idea
  > probably does not work :-(

This is not a problem for at least one reason, and I argue two reasons:
  1.  The idf factor in the document part of the scoring algorithm is precisely 
the same quantity as that in the query part.  I.e., for every term in the 
query, idf^2 is multiplied into the score.  Rewriting this factor into the 
boost associated with the term instead of the weight can be made consistent 
with current scoring by simply squaring idf.
  2.  Squaring idf is almost certainly a bad idea anyway.  There was a separate 
thread on this topic a while back.  In short, although Salton squared idf in 
his original vector space formula, even he dropped one idf term later on 
empricial grounds; the best scoring algorithms today do not square idf; the 
recent theoretical justifications for td*idf demonstrate why idf should not be 
squared.  Those topics are elaborated fully in the earlier thread.

Wolf Siberski writes:
  > Wolf Siberski schrieb:
  > > This is more or less how the patch I already submitted works
  > > (except that it ignored the query rewriting step). The problem I see
  > > with this now is that if I (ab-)use the Similarity class for idf
  > > transmission, it can't be redefined anymore by a user who wants to use
  > > a custom Similarity measure.

I think this could be addressed.  Whether it puts idf into the weight or into a 
term boost, MultiSearcher should use the Similarity to compute idf from 
docFreq.  This Similarity in MultiSearcher would remain specializable by the 
application as today.  One way to do achieve this would be to add a new class 
MultiSearcherDefaultSimilarity that adds a new method globalIdf() and 
specializes idf() to always return 1.  MultiSearcher based apps could then 
specialize globalIdf() to get the same capability as single Searcher based apps 
achieve by specializing idf().

  > > Probably the discussion is moot anyway, because I guess
  > > that about 99% Lucene-based applications use the default Similarity.

If true, I expect that is because 99% of Lucene applications do no relevance 
tuning.  I have found it impossible to get good relevance ranking without 
customizing the Similarity, and would be forced to drop Lucene (or rewrite the 
relevant parts) if this was not possible.

Chuck

  > -----Original Message-----
  > From: Christoph Goller [mailto:[EMAIL PROTECTED]
  > Sent: Thursday, January 27, 2005 3:36 AM
  > To: Lucene Developers List
  > Subject: Re: How to proceed with Bug 31841 - MultiSearcher problems with
  > Similarity.docFreq() ?
  > 
  > Wolf Siberski schrieb:
  > > This is more or less how the patch I already submitted works
  > > (except that it ignored the query rewriting step). The problem I see
  > > with this now is that if I (ab-)use the Similarity class for idf
  > > transmission, it can't be redefined anymore by a user who wants to use
  > > a custom Similarity measure.
  > 
  > My intention was to (ab-)use query boosts for idf transmission and to
  > overwrite Similarity so that local idf is ignored. The idea was to
  > simply multiply global idf into the given boost. Unfortunately idf is
  > not only used with the boosts and query normalization. It also occurs
  > in the document part of the scoring algorithm. If you look into
  > TermWeight.normalize(float queryNorm) there is an additional
  > multiplication with idf. The same holds for PhraseWeight. So my idea
  > probably does not work :-(
  > 
  > > But there is still the valid question why the Similarity is owned
  > > by the Searchables and not by the query. For me it seems to be more
  > > logical that the Similarity measure used should be part of the query,
  > > but of course there may be good reasons why this is not the case
  > > currently. Probably the discussion is moot anyway, because I guess
  > > that about 99% Lucene-based applications use the default Similarity.
  > 
  > Searchables currently don´t have a Similarity, Searchers do. By
  > overwriting Query.getSimilarity(Searcher) a Query may have its own
  > Similarity. Maybe Query should get a private member variable Similarity
  > that is null by default and getter/setters. If this variable is set
  > explicitly Query.getSimilarity(Searcher) could return Query´s Similarity,
  > otherwise the Searcher´s similarity. This looks reasonable to me. I
  > think
  > you had that kind of variable already.
  > 
  > Christoph
  > 
  > 
  > ---------------------------------------------------------------------
  > To unsubscribe, e-mail: [EMAIL PROTECTED]
  > For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: How to proceed with Bug 31841 - MultiSearcher problems with Similarity.docFreq() ?

Reply via email to