Re: Usefulness of Similarity.queryNorm()

Marvin Humphrey Tue, 12 Feb 2008 19:49:05 -0800


On Feb 12, 2008, at 5:04 PM, Grant Ingersoll wrote:

I don't know a lot about it, but my understanding has always beenthat comparing across queries is difficult at best, so that wouldargue for removing it, but I haven't done any research into it. Ithink it has been in Lucene for a good long time, so it may be thatthe history of why it is in there is forgotten.


It's called once per Query during Query.weight(Searcher):

/** Expert: Constructs and initializes a Weight for a top-levelquery. */

  public Weight weight(Searcher searcher)
    throws IOException {
    Query query = searcher.rewrite(this);
    Weight weight = query.createWeight(searcher);
    float sum = weight.sumOfSquaredWeights();

float norm = getSimilarity(searcher).queryNorm(sum); // <-------HERE

    weight.normalize(norm);
    return weight;
  }

It looks like Lucene actually *does* propagate the normalized sum-of-squared-weights into all sub-queries. That call toweight.normalize(norm) right before the end uses the value generatedby queryNorm(); BooleanWeight.normalize() (for example) propagates themodified value:


    public void normalize(float norm) {
      norm *= getBoost();                         // incorporate boost
      for (int i = 0 ; i < weights.size(); i++) {
        Weight w = (Weight)weights.elementAt(i);

// normalize all clauses, (even if prohibited in case of sideaffects)

        w.normalize(norm);
      }
    }

It's the *same* coefficient for all sub-clauses, so it shouldn'taffect rankings, BUT... relative rankings *will* be affected is someinner clauses have custom boost values.

It seems to me, conceptually, like code that claims to perform"normalization" shouldn't be able to affect rankings. However,because of this side effect of incorporating boost at thenormalization stage, it can.


I think.

This code is really hard to follow. :(

Also, do you have a sense of it's cost in terms of performance?


Nil.

It's only called once per Query and all it does by default is damp theweighting coefficient:


  multiplier = 1 / sqrt(multiplier)

If I reckon right, zapping it means that e.g. complex BooleanWeightobjects which return a high value for sumOfSquaredWeights() willproduce scores which are high, maybe startlingly high to some users.My guess is that the default implementation was chosen to complementthe sum-of-squared-weights algo.

I'm not sure I care whether the scoring range expands. Normalizingscores manually is cake, if people want to do that.

Heck, I'd love to eliminate ALL the automatic normalization code... ifonly I could figure out what all the hidden side effects are. :(

My goal is to de-voodoofy the Query-Weight-Scorer compilation phase sothat it's easier to write Query subclasses, and I'm happy to sacrificeconsistency of scoring range if it'll help simplify things.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Usefulness of Similarity.queryNorm()

Reply via email to