I also have the same task as you do. According to my understanding, suppose their are N documents, your approach will take N^2 similarity calculations.
Although there are N(N-1)/2 distinct document pairs, the similarity calculation (according to my understanding) in Lucene is asymmetric, so this means you have to calculate N(N-1) similaries. Therefore, seems your approach is not so redundant since you have to calculate O(N^2) order of similarities. On Tue, 30 Nov 2004, Roxana Angheluta wrote: > Dear all, > > Yesterday I've asked a question about geting the similarity matrix of a > collection of documents from an index, but I got only one answer, so > perhaps my question was not very clear. > > I will try to reformulate: > > I want to use Lucene to have efficient access to an index of a > collection of documents. My final purpose is to cluster documents. > Therefore I need to have for each pair of documents a number signifying > the similarity between them. > A possible solution would be to initialize in turn each document as a > query, do a search using an IndexSearcher and to take from the search > result the similarity between the query (which is in fact a document) > and all the other documents. This is highly redundant, because the > similarity between a pair of documents is computed multiple times. > > I was wondering whether there is a simpler way to do it, since the index > file contains all the information needed. Can anyone help me here? > > Thanks, > roxana > > PS I know about the project Carrot2, which deals with document > clustering, but I think is not appropriate for me because of 2 reasons: > 1) I need to keep the index on the disk for further reusage > 2) I need to be able to search efficiently in the index > I thought Lucene can help me here, am I wrong? > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
