Re: a "fair" similarity

Michael D. Curtin Mon, 14 Aug 2006 18:26:57 -0700

Daniel Naber wrote:

Hi,
as some of you may have noticed, Lucene prefers shorter documents overlonger ones, i.e. shorter documents get a higher ranking, even if theratio "matched terms / total terms in document" is the same.
For example, take these two artificial documents:

doc1: x 2 3 4 5 6 7 8 9 10
doc2: x x 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
When searching for "x" doc1 will get a higher ranking, even though "x"makes up 1/10 of the terms in both documents.

I think it depends upon what you want "similar" to mean. The shorter docthing comes from the "parsimony" concept, if I remember my Information Theorycorrectly. In other words, the less data to get to a given result (1/10 "x"in your example) the better. It sounds like you want doc1 and doc2 to beconsidered exactly similar, at least for "x". Would you want doc3 below to betreated the same way?


doc3: x  2  3  4  5  6  7  8  9 10
      x 12 13 14 15 16 17 18 19 20
      x 22 ...                  30
      x 32 ...                  40
                            ... 1000

In some situations, the appearance of "x" is more significant in doc1, becausehardly anything is there in the first place. I think that tends to be morecommon in English prose, which may be why it's the default in Lucene.

I think your proposed formula would treat all docs, 1-3, the same. If that'swhat you want, I'd say you're good to go.


--MDC

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: a "fair" similarity

Reply via email to