Ok, I see. Sorry for my unknowledge on these matters (I am going to read all 
the documentation you gave me closely).

But if I understood you well, and as far as I know, Mahout has its own k-means 
implementation. Then, could I use it for my purposes instead of DP like setup?

Thank you very much, Isabel.

Regards,
jfcg

> Date: Tue, 1 Sep 2009 08:23:05 +0200
> From: [email protected]
> To: [email protected]
> Subject: Re: String clustering and other newbie questions
> 
> On Mon, 31 Aug 2009 14:02:08 +0200
> Juan Francisco Contreras Gaitan <[email protected]> wrote:
> 
> > Thank you very much for your answer, but I think I can't understand
> > it very well. Could you give me some more details?
> 
> Taking up that question, Ted, please correct me anywhere where I'm
> wrong.
> 
> 
> > For example, what does 'DP' stand for?
> 
> DP stands for Dirichlet Process, sometimes also referred to as "chinese
> restaurant process". There is a nice wikipedia page on dirichlet
> processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process
> 
> An explanation of how they were employed to implement a clustering
> algorithm in Mahout is explained on one of our wiki pages (including
> references to the original papers):
> 
> http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html
> 
> 
> > You can see an example of what I would like to
> > do in my previous answer.
> 
> In a k-Means like setup, you would implement your own distance
> (Levenstein in your case) and use that to assign items to clusters
> during the E(stimation)-step. After that you would employ your own
> implementation of a centroid selection algorithm for recomputing
> cluster-centroids during the M(aximisation)-step.
> 
> In a DP like setup it would look a little different: During the E step
> instead of having k cluster centers, computing distances to these
> clusters and doing hard assignments you would have k cluster models
> and compute a probability of the strings being generated by each
> model. During the M step you would then recompute each cluster model
> based how likely each string was found to be generated by that model.
> To arrive at a final assignment, after the assignment probabilities
> become stable you could choose to assign each point to the model with
> highest probability.
> 
>  
> Isabel

_________________________________________________________________
Messenger cumple 10 años ¡Descárgate ya los nuevos emoticonos!
http://www.vivelive.com/felicidades

Reply via email to