Re: Document Term matrix

parnab kumar Tue, 11 Nov 2014 13:40:03 -0800

hi,

 While indexing the documents , store the Term Vectors for the content
field. Now for each document you will have an array of terms  and their
corresponding frequency in the document. Using the Index Reader you can
retrieve this term vectors. Similarity between two documents can be
computed as the similarity of their term vectors. Since tf-idf is most well
known and seems to give better sense of similarity, simply multiply the idf
of the term with the frequency to re weight the vectors.


Thanks,
Parnab

On Tue, Nov 11, 2014 at 8:36 PM, Elshaimaa Ali <[email protected]>
wrote:

> Hi All,
> I have a Lucene index built with Lucene 4.9 for 584 text documents, I need
> to extract a Document-term matrix, and Document Document similarity matrix
> in-order to use it to cluster the documents. My questions:1- How can I
> extract the matrix and compute the similarity between documents in
> Lucene.2- Is there any java based code that can cluster the documents from
> Lucene index.
> RegardsShaimaa
>
>

Re: Document Term matrix

Reply via email to