similarity matrix - more clear

Roxana Angheluta Tue, 30 Nov 2004 06:08:00 -0800

Dear all,

Yesterday I've asked a question about geting the similarity matrix of a collection of documents from an index, but I got only one answer, so perhaps my question was not very clear.

I will try to reformulate:

I want to use Lucene to have efficient access to an index of a collection of documents. My final purpose is to cluster documents. Therefore I need to have for each pair of documents a number signifying the similarity between them. A possible solution would be to initialize in turn each document as a query, do a search using an IndexSearcher and to take from the search result the similarity between the query (which is in fact a document) and all the other documents. This is highly redundant, because the similarity between a pair of documents is computed multiple times.

I was wondering whether there is a simpler way to do it, since the index file contains all the information needed. Can anyone help me here?

Thanks,
roxana

PS I know about the project Carrot2, which deals with document clustering, but I think is not appropriate for me because of 2 reasons: 1) I need to keep the index on the disk for further reusage 2) I need to be able to search efficiently in the index I thought Lucene can help me here, am I wrong?

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

similarity matrix - more clear

Reply via email to