Daniel Naber wrote:
Hi,
as some of you may have noticed, Lucene prefers shorter documents over
longer ones, i.e. shorter documents get a higher ranking, even if the
ratio "matched terms / total terms in document" is the same.
For example, take these two artificial documents:
doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
When searching for "x" doc1 will get a higher ranking, even though "x"
makes up 1/10 of the terms in both documents.
I think it depends upon what you want "similar" to mean. The shorter doc
thing comes from the "parsimony" concept, if I remember my Information Theory
correctly. In other words, the less data to get to a given result (1/10 "x"
in your example) the better. It sounds like you want doc1 and doc2 to be
considered exactly similar, at least for "x". Would you want doc3 below to be
treated the same way?
doc3: x 2 3 4 5 6 7 8 9 10
x 12 13 14 15 16 17 18 19 20
x 22 ... 30
x 32 ... 40
... 1000
In some situations, the appearance of "x" is more significant in doc1, because
hardly anything is there in the first place. I think that tends to be more
common in English prose, which may be why it's the default in Lucene.
I think your proposed formula would treat all docs, 1-3, the same. If that's
what you want, I'd say you're good to go.
--MDC
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]