Dear all,

Yesterday I've asked a question about geting the similarity matrix of a collection of documents from an index, but I got only one answer, so perhaps my question was not very clear.

I will try to reformulate:

I want to use Lucene to have efficient access to an index of a collection of documents. My final purpose is to cluster documents. Therefore I need to have for each pair of documents a number signifying the similarity between them.
A possible solution would be to initialize in turn each document as a query, do a search using an IndexSearcher and to take from the search result the similarity between the query (which is in fact a document) and all the other documents. This is highly redundant, because the similarity between a pair of documents is computed multiple times.


I was wondering whether there is a simpler way to do it, since the index file contains all the information needed. Can anyone help me here?

Thanks,
roxana

PS I know about the project Carrot2, which deals with document clustering, but I think is not appropriate for me because of 2 reasons:
1) I need to keep the index on the disk for further reusage
2) I need to be able to search efficiently in the index
I thought Lucene can help me here, am I wrong?


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Reply via email to