RE: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Mark Bennett Thu, 07 Jul 2005 15:16:53 -0700

Hello Marvin,

Thanks for the reply.

Scanning their paper very quickly, I didn't see a specific mention (though I
might have missed it) of extremely short documents (< 5 words).  Was there
something specific about 1 and 2 word documents you had in mind?

Good point on which field.  I was thinking of the "main" field, the body of
the message.  Certainly titles would be expected to be shorter.

Mark

-----Original Message-----
From: Marvin Humphrey [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 07, 2005 2:39 PM
To: [email protected]
Cc: Mark Bennett
Subject: Re: Proposal for change to DefaultSimilarity's lengthNorm to fix
"short document" problem

On Jul 7, 2005, at 1:39 PM, Mark Bennett wrote:
> Our client, Rojo, is considering overriding the default  
> implementation of
> lengthNorm to fix the bias towards extremely short RSS documents.

Different normalization schemes are given a thorough examination in  
this 1997 paper:

http://www.cs.ust.hk/faculty/dlee/Papers/ir/ieee-sw-rank.pdf

Here is what they have to say about the ideal case, "full  
normalization":

[begin excerpt]

... a document containing {x, y, z}
will have exactly the same score as
another document containing {x, x, y,
y, z, z} because these two document
vectors have the same unit vector. We
can debate whether this is reasonable
or not, but when document lengths
vary greatly, it makes sense to take
them into account.

[end excerpt]

Their experimental results indicate that the Lucene default -- 1/sqrt 
(num_terms) -- is quite effective.  The effect upon precision of the  
various normalization schemes is specific to the characteristics of  
the document collection, though.  Extremely short RSS documents would  
seem to be an outlying case.  Anything short of (prohibitively  
expensive) full normalization requires a bias towards one length of  
document.  If you assign maximum weight to the 50-term documents,  
you've probably penalized dictionary definitions.  FWIW, (this is my  
second Lucene post -- I'm not involved with the project), I would  
lean towards the clip method as a default, but it's certainly  
justifiable to tweak a normalization scheme to suit your needs.

> The "flat" and "stretch" factors are specific to my formula.  I've  
> tried
> playing around with how gradual the curve slopes away for smaller  
> and larger
> documents; for example, the red curve really "punishes" documents  
> with less
> than 5 words.

Please correct me if I'm wrong, but isn't num_terms in Lucene's 1/sqrt 
(num_terms) the number of terms in the field, rather than the number  
of terms in the document?  If that's true, then how would adopting a  
different curve as default affect the relative weight of a "title"  
field?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Reply via email to