Re: String clustering and other newbie questions

Isabel Drost Mon, 31 Aug 2009 23:20:03 -0700

On Mon, 31 Aug 2009 14:02:08 +0200
Juan Francisco Contreras Gaitan <[email protected]> wrote:


> Thank you very much for your answer, but I think I can't understand
> it very well. Could you give me some more details?

Taking up that question, Ted, please correct me anywhere where I'm
wrong.


> For example, what does 'DP' stand for?

DP stands for Dirichlet Process, sometimes also referred to as "chinese
restaurant process". There is a nice wikipedia page on dirichlet
processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process

An explanation of how they were employed to implement a clustering
algorithm in Mahout is explained on one of our wiki pages (including
references to the original papers):

http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html


> You can see an example of what I would like to
> do in my previous answer.

In a k-Means like setup, you would implement your own distance
(Levenstein in your case) and use that to assign items to clusters
during the E(stimation)-step. After that you would employ your own
implementation of a centroid selection algorithm for recomputing
cluster-centroids during the M(aximisation)-step.

In a DP like setup it would look a little different: During the E step
instead of having k cluster centers, computing distances to these
clusters and doing hard assignments you would have k cluster models
and compute a probability of the strings being generated by each
model. During the M step you would then recompute each cluster model
based how likely each string was found to be generated by that model.
To arrive at a final assignment, after the assignment probabilities
become stable you could choose to assign each point to the model with
highest probability.

 
Isabel

Re: String clustering and other newbie questions

Reply via email to