Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

DM Smith Fri, 08 Jul 2005 05:19:53 -0700

At crosswire.org we are using Lucene to index Bibles with each Biblehaving its own index and eachverse in the Bible is a document in the index. So each document isshort. Length depends upon the

language of translation, but the lengths are from 2 to less than 100.

In our case the existing bias seems appropriate and it does not appearto break down for extremely

short documents.

I would suggest that if the bias is changed that it be based upon thelength and distribution of documents

in the index. Or it be driven by programmer supplied parameters.

Mark Bennett wrote:

Our client, Rojo, is considering overriding the default implementation of
lengthNorm to fix the bias towards extremely short RSS documents.

The general idea put forth by Doug was that longer documents tend to have
more instances of matching words simply because they are longer, whereas
shorter documents tend to be more precise and should therefore be considered
more authoritative.

While we generally agree with this idea, it seems to break down for
extremely short documents.  For example, one and two word documents tend to
be test messages, error messages, or simple answers with no accompanying
context.

I've seen discussions of this before from Doug, Chuck, Kevin and Sanji;
likely others have posted as well.  We'd like to get your feedback on our
current idea for a new implementation, and perhaps eventually see about
getting the default Lucene formula changed.

Pictures speak louder than words.  I've attached a graph of what I'm about
to talk about, and if the attachment is not visible, I've also posted it
online at:
http://ideaeng.com/customers/rojo/lucene-doclength-normalization.gif

Looking at the graph, the default Lucene implementation is represented by
the dashed dark-purple line.  As you can see it's giving the highest scores
for documents with less than 5 words, with the max score going to single
word documents.  Doug's quick fix for clipping the score for documents with
less than 100 terms is shown in light purple.

Rojo's idea was to target documents of a particular length (we've chosen 50
for this graph), and then have a smooth curve that slopes away from there
for larger and smaller documents.  The red, green and blue curves are some
experiments I did trying to stretch out the standard "bell curve" (see
http://en.wikipedia.org/wiki/Normal_distribution)

The "flat" and "stretch" factors are specific to my formula.  I've tried
playing around with how gradual the curve slopes away for smaller and larger
documents; for example, the red curve really "punishes" documents with less
than 5 words.

We'd really appreciate your feedback on this, as we do plan to do
"something".  After figuring out what the curve "should be", the next items
on our end are implementation and fixing our excising indices, which I'll
save for a later post.

Thanks in advance for your feedback,
Mark Bennett
[EMAIL PROTECTED]
(on behalf of rojo.com)

------------------------------------------------------------------------

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Proposal for change to DefaultSimilarity's lengthNorm to fix "short document" problem

Reply via email to