Hi,

can you provide a minimal example (no. of sentences max 5)? 1 -> 0.85 seems a rather big decrease in score to me, so unless you removed the longest sentence with the rarest words in the collection, I smell some bug, e.g. you forgot to remove it from the denominator as well, etc. It would also be a good idea to compute the distance without IDF weighting to see if you experience a similar effect.

Regards,
David Nemeskey

Quoting Kasun Perera <kas...@opensource.lk>:

Hi all

I’m indexing collection of documents using Lucene specifying TermVerctor at
the indexing time. Then I retrieve terms and their term frequencies by
reading the index and calculate TF-IDF scores vector for each document.
Then using TF-IDF vectors, I calculate pairwise cosine similarity between
documents using the equation here
http://en.wikipedia.org/wiki/Cosine_similarity.

This is my problem

Say I have two identical documents “A” and “B” in this collection (A and B
have more than 200 sentences).

If I calculate pairwise cosine similarity between A and B it gives me
cosine value=1 which is perfectly OK.

But If I remove a single sentence from Doc “B”, it gives me cosine
similarity value around 0.85 between these two documents. The documents are
almost similar but cosine values are not. I understand the problem is with
the equation that I’m using.

Is there better way/ better equation that I can use for calculating cosine
similarity between documents?

--
Regards

Kasun Perera




----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to