Daniel Naber wrote:
Hi,

as some of you may have noticed, Lucene prefers shorter documents over longer ones, i.e. shorter documents get a higher ranking, even if the ratio "matched terms / total terms in document" is the same.

For example, take these two artificial documents:

doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

When searching for "x" doc1 will get a higher ranking, even though "x" makes up 1/10 of the terms in both documents.

I think it depends upon what you want "similar" to mean. The shorter doc thing comes from the "parsimony" concept, if I remember my Information Theory correctly. In other words, the less data to get to a given result (1/10 "x" in your example) the better. It sounds like you want doc1 and doc2 to be considered exactly similar, at least for "x". Would you want doc3 below to be treated the same way?

doc3: x  2  3  4  5  6  7  8  9 10
      x 12 13 14 15 16 17 18 19 20
      x 22 ...                  30
      x 32 ...                  40
                            ... 1000

In some situations, the appearance of "x" is more significant in doc1, because hardly anything is there in the first place. I think that tends to be more common in English prose, which may be why it's the default in Lucene.

I think your proposed formula would treat all docs, 1-3, the same. If that's what you want, I'd say you're good to go.

--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to