köszi!
On Fri, May 18, 2012 at 11:19 AM, Kasun Perera <kas...@opensource.lk> wrote: > Hi all > > I’m indexing collection of documents using Lucene specifying TermVerctor at > the indexing time. Then I retrieve terms and their term frequencies by > reading the index and calculate TF-IDF scores vector for each document. > Then using TF-IDF vectors, I calculate pairwise cosine similarity between > documents using the equation here > http://en.wikipedia.org/wiki/Cosine_similarity. > > This is my problem > > Say I have two identical documents “A” and “B” in this collection (A and B > have more than 200 sentences). > > If I calculate pairwise cosine similarity between A and B it gives me > cosine value=1 which is perfectly OK. > > But If I remove a single sentence from Doc “B”, it gives me cosine > similarity value around 0.85 between these two documents. The documents are > almost similar but cosine values are not. I understand the problem is with > the equation that I’m using. > > Is there better way/ better equation that I can use for calculating cosine > similarity between documents? > > -- > Regards > > Kasun Perera >