hi, While indexing the documents , store the Term Vectors for the content field. Now for each document you will have an array of terms and their corresponding frequency in the document. Using the Index Reader you can retrieve this term vectors. Similarity between two documents can be computed as the similarity of their term vectors. Since tf-idf is most well known and seems to give better sense of similarity, simply multiply the idf of the term with the frequency to re weight the vectors.
Thanks, Parnab On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali <elshaimaa....@hotmail.com> wrote: > Hi All, > I have a Lucene index built with Lucene 4.9 for 584 text documents, I need > to extract a Document-term matrix, and Document Document similarity matrix > in-order to use it to cluster the documents. My questions:1- How can I > extract the matrix and compute the similarity between documents in > Lucene.2- Is there any java based code that can cluster the documents from > Lucene index. > RegardsShaimaa > >