Re: search quality - assessment & improvements

Chris Hostetter Mon, 25 Jun 2007 16:40:19 -0700

: For the first change, logic is that Lucene's default length normalization
: punishes long documents too much. I found contrib's sweet-spot-similarity
: helpful here, but not enough. I found that a better doc-length
: normalization method is one that considers collection statistics - e.g.
: average doc length. The nice problem with such an approach is that you


for the record: SweetSpotSimilarity was designed to let you use the
average length as the "sweet spot" (both min and max) but as you say: you
have to know/guess the average length in advance (and explicitly set it
using SweetSpotSimilarity.setLengthNormFactors(yourAvg,yourAvg,steepness))

: able to boost the title by (say) 3, but in fact, there is no "IR'ish"
: difference between finding the searched text in the title field or in the
: body field - they really serve/answer the same information need. For that
: matter, I believe that using a single document length when searching all
: these fields is more "accurate".

if i understand your argument, the pure "IR'ish" model makes no difference
between finding input in differnet fields, so it should use a single
document length -- but that assumes you are searching for the input in all
fields right?  if you have a title field and a body field but you are only
searching for "title:word" then why should the length of hte body fieild
factor into the score at all?

my basic understanding of IRish relevancy judgements of the idf/tf model
is  that "documents" are generic lumps of text .. while Lucene has a
"Document" class that supports multipel Fields, the most direct mapping of
IR "documents" and Lucene objects would be a Document with only one Field
... but that doesn't mean that if a Document has more then one field that
the length of all fields should be a facor in searching for a
"field1:word" (any more then the tf used should be the frequency of
"word" in all fields)


Note: for the record, i think it would be nice if the Score computation
"engine" of Lucene allowed for more hooks to let people control stuff like
this, so that it is possible in use cases where it makes sense -- i'm just
cno convinced it makes sense as the general case. (ie: even if it were
possible, i don't think it would make sense as the default)

: that doc. If you agree about the potential improvement here, again, a nice
: problem is that current Similarity API does not even allow to consider this
: info (the average term frequency in the specific document) because
: Similarity.tf(int/float freq) takes only the frequency param. One way to
: open way for such computation is to add an "int docid" param to the
: Similarity class, but then the implementation of that class becomes
: IndexReader aware.

this strikes me as as not being a deficiency in the existing
Query/Scorer/Similaity APIs as much as perhaps an opportunity for new
Query types (with new Scorer types) that take advantage of the Payloads
API (i say this having almost 0 knowledge of what the payloads API looks
like) to take advantage of more term, doc, and index level payloads.  Any
of this type of logic could be put into new Query subclasses, the trick
becomes allowing the common match of those new Query classes to be
refactored out and controled by classes at runtime -- this is really what
the current Similarity class does, provide a refactoring of common types
of calculations in such a way that those calculations can be replaced at
run time (at least: that's how i view the Similarity class)

i'm just typing out loud here without thinkng it through, but one appraoch
we could take to allow Similaity to be extended in special uses cases,
would be if each new Query's getSimilarity() method tested the similarity
from it's parent to see if it implemented an interface indicating that it
contained the method(s) this query needed to perform it's custom
calculations ... if it does: hurray! ... if not, then wrap it in the
"default" version of that marker interface.

ie...

public class DocRelativeTfTermQuery extends TermQuery {
  ...
  public Similarity getSimilarity(Searcher searcher) {
    Similarity s = super.getSimilarity(Searcher searcher);
    if (s instanceOf DocRelativeTfSimilaity) return s;
    return AverageTfSimilaityWrapper(s);
  }
  public static interface DocRelativeTfSimilaity {
    public float tf(int doc, float freq);
  }
  public static class AverageTfSimilaityWrapper
                extends SimilarityDelegator
                implements DocRelativeTfSimilaity {
    public DocRelativeTfSimilarityWrapper(Similarity s) { super(s) }
    public float tf(int doc, float freq) {
      return tf(freq - averageFreq(doc));
    }
    protected averageFreq(int doc) {
      ...get from document payload ...
    }
  }


...so now by default the DocRelativeTfTermQuery computes the score for
that term based on the average tf of all terms in that field ... but if
you construct a Similarity of your own that implements
DocRelativeTfSimilaity you can make it use any document specific
calculation you want.

The performance of simple queries doesn't change, the api for simple users
of Similaity doesnt' get more complicated, and we can start supporting
arbitrarily complex scoring calculations in Scorers and still allow the
"client" to supply the meat of those calculations atrun time.



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: search quality - assessment & improvements

Reply via email to