Ok, I see. Sorry for my unknowledge on these matters (I am going to read all the documentation you gave me closely).
But if I understood you well, and as far as I know, Mahout has its own k-means implementation. Then, could I use it for my purposes instead of DP like setup? Thank you very much, Isabel. Regards, jfcg > Date: Tue, 1 Sep 2009 08:23:05 +0200 > From: [email protected] > To: [email protected] > Subject: Re: String clustering and other newbie questions > > On Mon, 31 Aug 2009 14:02:08 +0200 > Juan Francisco Contreras Gaitan <[email protected]> wrote: > > > Thank you very much for your answer, but I think I can't understand > > it very well. Could you give me some more details? > > Taking up that question, Ted, please correct me anywhere where I'm > wrong. > > > > For example, what does 'DP' stand for? > > DP stands for Dirichlet Process, sometimes also referred to as "chinese > restaurant process". There is a nice wikipedia page on dirichlet > processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process > > An explanation of how they were employed to implement a clustering > algorithm in Mahout is explained on one of our wiki pages (including > references to the original papers): > > http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html > > > > You can see an example of what I would like to > > do in my previous answer. > > In a k-Means like setup, you would implement your own distance > (Levenstein in your case) and use that to assign items to clusters > during the E(stimation)-step. After that you would employ your own > implementation of a centroid selection algorithm for recomputing > cluster-centroids during the M(aximisation)-step. > > In a DP like setup it would look a little different: During the E step > instead of having k cluster centers, computing distances to these > clusters and doing hard assignments you would have k cluster models > and compute a probability of the strings being generated by each > model. During the M step you would then recompute each cluster model > based how likely each string was found to be generated by that model. > To arrive at a final assignment, after the assignment probabilities > become stable you could choose to assign each point to the model with > highest probability. > > > Isabel _________________________________________________________________ Messenger cumple 10 años ¡Descárgate ya los nuevos emoticonos! http://www.vivelive.com/felicidades
