On Feb 12, 2008, at 5:04 PM, Grant Ingersoll wrote:
I don't know a lot about it, but my understanding has always been
that comparing across queries is difficult at best, so that would
argue for removing it, but I haven't done any research into it. I
think it has been in Lucene for a good long time, so it may be that
the history of why it is in there is forgotten.
It's called once per Query during Query.weight(Searcher):
/** Expert: Constructs and initializes a Weight for a top-level
query. */
public Weight weight(Searcher searcher)
throws IOException {
Query query = searcher.rewrite(this);
Weight weight = query.createWeight(searcher);
float sum = weight.sumOfSquaredWeights();
float norm = getSimilarity(searcher).queryNorm(sum); // <-------
HERE
weight.normalize(norm);
return weight;
}
It looks like Lucene actually *does* propagate the normalized sum-of-
squared-weights into all sub-queries. That call to
weight.normalize(norm) right before the end uses the value generated
by queryNorm(); BooleanWeight.normalize() (for example) propagates the
modified value:
public void normalize(float norm) {
norm *= getBoost(); // incorporate boost
for (int i = 0 ; i < weights.size(); i++) {
Weight w = (Weight)weights.elementAt(i);
// normalize all clauses, (even if prohibited in case of side
affects)
w.normalize(norm);
}
}
It's the *same* coefficient for all sub-clauses, so it shouldn't
affect rankings, BUT... relative rankings *will* be affected is some
inner clauses have custom boost values.
It seems to me, conceptually, like code that claims to perform
"normalization" shouldn't be able to affect rankings. However,
because of this side effect of incorporating boost at the
normalization stage, it can.
I think.
This code is really hard to follow. :(
Also, do you have a sense of it's cost in terms of performance?
Nil.
It's only called once per Query and all it does by default is damp the
weighting coefficient:
multiplier = 1 / sqrt(multiplier)
If I reckon right, zapping it means that e.g. complex BooleanWeight
objects which return a high value for sumOfSquaredWeights() will
produce scores which are high, maybe startlingly high to some users.
My guess is that the default implementation was chosen to complement
the sum-of-squared-weights algo.
I'm not sure I care whether the scoring range expands. Normalizing
scores manually is cake, if people want to do that.
Heck, I'd love to eliminate ALL the automatic normalization code... if
only I could figure out what all the hidden side effects are. :(
My goal is to de-voodoofy the Query-Weight-Scorer compilation phase so
that it's easier to write Query subclasses, and I'm happy to sacrifice
consistency of scoring range if it'll help simplify things.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]