On Feb 12, 2008, at 5:04 PM, Grant Ingersoll wrote:

I don't know a lot about it, but my understanding has always been that comparing across queries is difficult at best, so that would argue for removing it, but I haven't done any research into it. I think it has been in Lucene for a good long time, so it may be that the history of why it is in there is forgotten.

It's called once per Query during Query.weight(Searcher):

/** Expert: Constructs and initializes a Weight for a top-level query. */
  public Weight weight(Searcher searcher)
    throws IOException {
    Query query = searcher.rewrite(this);
    Weight weight = query.createWeight(searcher);
    float sum = weight.sumOfSquaredWeights();
float norm = getSimilarity(searcher).queryNorm(sum); // <------- HERE
    weight.normalize(norm);
    return weight;
  }

It looks like Lucene actually *does* propagate the normalized sum-of- squared-weights into all sub-queries. That call to weight.normalize(norm) right before the end uses the value generated by queryNorm(); BooleanWeight.normalize() (for example) propagates the modified value:

    public void normalize(float norm) {
      norm *= getBoost();                         // incorporate boost
      for (int i = 0 ; i < weights.size(); i++) {
        Weight w = (Weight)weights.elementAt(i);
// normalize all clauses, (even if prohibited in case of side affects)
        w.normalize(norm);
      }
    }

It's the *same* coefficient for all sub-clauses, so it shouldn't affect rankings, BUT... relative rankings *will* be affected is some inner clauses have custom boost values.

It seems to me, conceptually, like code that claims to perform "normalization" shouldn't be able to affect rankings. However, because of this side effect of incorporating boost at the normalization stage, it can.

I think.

This code is really hard to follow. :(

Also, do you have a sense of it's cost in terms of performance?

Nil.

It's only called once per Query and all it does by default is damp the weighting coefficient:

  multiplier = 1 / sqrt(multiplier)

If I reckon right, zapping it means that e.g. complex BooleanWeight objects which return a high value for sumOfSquaredWeights() will produce scores which are high, maybe startlingly high to some users. My guess is that the default implementation was chosen to complement the sum-of-squared-weights algo.

I'm not sure I care whether the scoring range expands. Normalizing scores manually is cake, if people want to do that.

Heck, I'd love to eliminate ALL the automatic normalization code... if only I could figure out what all the hidden side effects are. :(

My goal is to de-voodoofy the Query-Weight-Scorer compilation phase so that it's easier to write Query subclasses, and I'm happy to sacrifice consistency of scoring range if it'll help simplify things.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to