Hello Marvin, Thanks for the reply.
Scanning their paper very quickly, I didn't see a specific mention (though I might have missed it) of extremely short documents (< 5 words). Was there something specific about 1 and 2 word documents you had in mind? Good point on which field. I was thinking of the "main" field, the body of the message. Certainly titles would be expected to be shorter. Mark -----Original Message----- From: Marvin Humphrey [mailto:[EMAIL PROTECTED] Sent: Thursday, July 07, 2005 2:39 PM To: [email protected] Cc: Mark Bennett Subject: Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote: > Our client, Rojo, is considering overriding the default > implementation of > lengthNorm to fix the bias towards extremely short RSS documents. Different normalization schemes are given a thorough examination in this 1997 paper: http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf Here is what they have to say about the ideal case, "full normalization": [begin excerpt] ... a document containing {x, y, z} will have exactly the same score as another document containing {x, x, y, y, z, z} because these two document vectors have the same unit vector. We can debate whether this is reasonable or not, but when document lengths vary greatly, it makes sense to take them into account. [end excerpt] Their experimental results indicate that the Lucene default -- 1/sqrt (num_terms) -- is quite effective. The effect upon precision of the various normalization schemes is specific to the characteristics of the document collection, though. Extremely short RSS documents would seem to be an outlying case. Anything short of (prohibitively expensive) full normalization requires a bias towards one length of document. If you assign maximum weight to the 50-term documents, you've probably penalized dictionary definitions. FWIW, (this is my second Lucene post -- I'm not involved with the project), I would lean towards the clip method as a default, but it's certainly justifiable to tweak a normalization scheme to suit your needs. > The "flat" and "stretch" factors are specific to my formula. I've > tried > playing around with how gradual the curve slopes away for smaller > and larger > documents; for example, the red curve really "punishes" documents > with less > than 5 words. Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt (num_terms) the number of terms in the field, rather than the number of terms in the document? If that's true, then how would adopting a different curve as default affect the relative weight of a "title" field? Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
