I'v been thinking about a similar problem.  However, it seems that the 
similarity score returned by a search is only relevant within those search 
results.  You can't compare the similarity scores from two different searches.  
I think you will have to compute the similarities yourself using the term 
vectors.

-John

-----Original Message-----
From: Prasenjit Mukherjee [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 15, 2006 6:51 AM
To: java-user@lucene.apache.org
Subject: Document clustering using lucene


I want to do some document  clustering on a corpus of  ~ 100,000 
documents, with average doc size being ~ 7k. I have looked into carrot2 
but it seems to work only for relatively short documents and has soem 
scalign issues for large corpus.  Certainly for these kind of corpus 
size, one cannot use a pure memory based clustering algorithm. Hence the 
possible use of lucene.

I was thinking of using lucene to create the similarity matrix (between 
documents).  Before adding a document (i.e. D-k) to the lucene index, we 
can compute the document similarity between D-k with all other existing 
documents by creating a Query out of D-k and doing a search on the 
existing index. We can take the score of each document as   similarity 
measure between the document and D-k. It is going to be a symmetric and 
parse matrix. Now we can use this similarity  matrix and feed it to any 
similarity based clustering algorithm.

Would like to know if anyone has worked along similar lines, and are 
happy  to share their experiences.

thanks,
Prasen



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to