You're right. I want document clustering precisely the documents that are 
already in the index. I don't know much about Mahout project, but it seems that 
it doesn't help much. What I want is simply to group together similar documents 
according to their similarity distance of the term vectors.

Anyway, thank you Otis and Grant for your suggestions. I appreciate them.

Regards,

Supheakmungkol


----- Original Message ----
From: Grant Ingersoll <[EMAIL PROTECTED]>
To: java-user@lucene.apache.org
Sent: Friday, May 16, 2008 7:22:39 PM
Subject: Re: Document clustering with Lucene

Do you want search result clustering or document clustering?  My  
understanding of Carrot2 is it isn't designed for the latter.  The  
difference being it is designed to work off of shorter snippets of  
text, as opposed to the whole document.  FWIW, you _might_ find some  
help over on the Mahout project (http://lucene.apache.org/mahout) in  
terms of different approaches.  We have a couple of clustering  
approaches implemented there, but they just work off a matrix and it  
is up to you to fill the matrix (presumably with some distance  
calculation.)  The Carrot2 list may also help clarify if it is the  
appropriate place.

-Grant

On May 15, 2008, at 4:34 PM, Otis Gospodnetic wrote:

> Have you tried using Carrot2 with Lucene?  They work quite well in  
> tandem!
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
> ----- Original Message ----
>> From: Supheakmungkol SARIN <[EMAIL PROTECTED]>
>> To: java-user@lucene.apache.org
>> Sent: Wednesday, May 14, 2008 11:23:45 PM
>> Subject: Document clustering with Lucene
>>
>> Dear all,
>>
>> I'd like to do document clustering using full-text with Lucene. In  
>> other words,
>> I would like to group similar documents in their respective groups.  
>> I searched
>> the mailing list and found that there are two ways around. The  
>> first method is
>> to represent the one document as query and search the collection.  
>> The other way
>> would be to construct the vector of terms of each of the documents  
>> and use the
>> cosine distance function to compute the similarity. I found these  
>> methods here:
>>
>> - http://www.mail-archive.com/[EMAIL PROTECTED]/msg04916.html) 
>> .
>>
>> I would like to know whether there are better way? or any built-in  
>> functions to
>> do clustering in the recent release version of Lucene?
>>
>> Thank you.
>>
>> Kind regards,
>>
>> Supheakmungkol
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ







---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to