On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote:
Our client, Rojo, is considering overriding the default
implementation of
lengthNorm to fix the bias towards extremely short RSS documents.
Different normalization schemes are given a thorough examination in
this 1997 paper:
http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf
Here is what they have to say about the ideal case, "full
normalization":
[begin excerpt]
... a document containing {x, y, z}
will have exactly the same score as
another document containing {x, x, y,
y, z, z} because these two document
vectors have the same unit vector. We
can debate whether this is reasonable
or not, but when document lengths
vary greatly, it makes sense to take
them into account.
[end excerpt]
Their experimental results indicate that the Lucene default -- 1/sqrt
(num_terms) -- is quite effective. The effect upon precision of the
various normalization schemes is specific to the characteristics of
the document collection, though. Extremely short RSS documents would
seem to be an outlying case. Anything short of (prohibitively
expensive) full normalization requires a bias towards one length of
document. If you assign maximum weight to the 50-term documents,
you've probably penalized dictionary definitions. FWIW, (this is my
second Lucene post -- I'm not involved with the project), I would
lean towards the clip method as a default, but it's certainly
justifiable to tweak a normalization scheme to suit your needs.
The "flat" and "stretch" factors are specific to my formula. I've
tried
playing around with how gradual the curve slopes away for smaller
and larger
documents; for example, the red curve really "punishes" documents
with less
than 5 words.
Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt
(num_terms) the number of terms in the field, rather than the number
of terms in the document? If that's true, then how would adopting a
different curve as default affect the relative weight of a "title"
field?
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]