On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote:
Our client, Rojo, is considering overriding the default implementation of
lengthNorm to fix the bias towards extremely short RSS documents.

Different normalization schemes are given a thorough examination in this 1997 paper:

http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Here is what they have to say about the ideal case, "full normalization":

[begin excerpt]

... a document containing {x, y, z}
will have exactly the same score as
another document containing {x, x, y,
y, z, z} because these two document
vectors have the same unit vector. We
can debate whether this is reasonable
or not, but when document lengths
vary greatly, it makes sense to take
them into account.

[end excerpt]

Their experimental results indicate that the Lucene default -- 1/sqrt (num_terms) -- is quite effective. The effect upon precision of the various normalization schemes is specific to the characteristics of the document collection, though. Extremely short RSS documents would seem to be an outlying case. Anything short of (prohibitively expensive) full normalization requires a bias towards one length of document. If you assign maximum weight to the 50-term documents, you've probably penalized dictionary definitions. FWIW, (this is my second Lucene post -- I'm not involved with the project), I would lean towards the clip method as a default, but it's certainly justifiable to tweak a normalization scheme to suit your needs.

The "flat" and "stretch" factors are specific to my formula. I've tried playing around with how gradual the curve slopes away for smaller and larger documents; for example, the red curve really "punishes" documents with less
than 5 words.

Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt (num_terms) the number of terms in the field, rather than the number of terms in the document? If that's true, then how would adopting a different curve as default affect the relative weight of a "title" field?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to