On Mon, 31 Aug 2009 14:02:08 +0200 Juan Francisco Contreras Gaitan <[email protected]> wrote:
> Thank you very much for your answer, but I think I can't understand > it very well. Could you give me some more details? Taking up that question, Ted, please correct me anywhere where I'm wrong. > For example, what does 'DP' stand for? DP stands for Dirichlet Process, sometimes also referred to as "chinese restaurant process". There is a nice wikipedia page on dirichlet processes themselves: http://en.wikipedia.org/wiki/Dirichlet_process An explanation of how they were employed to implement a clustering algorithm in Mahout is explained on one of our wiki pages (including references to the original papers): http://cwiki.apache.org/MAHOUT/dirichlet-process-clustering.html > You can see an example of what I would like to > do in my previous answer. In a k-Means like setup, you would implement your own distance (Levenstein in your case) and use that to assign items to clusters during the E(stimation)-step. After that you would employ your own implementation of a centroid selection algorithm for recomputing cluster-centroids during the M(aximisation)-step. In a DP like setup it would look a little different: During the E step instead of having k cluster centers, computing distances to these clusters and doing hard assignments you would have k cluster models and compute a probability of the strings being generated by each model. During the M step you would then recompute each cluster model based how likely each string was found to be generated by that model. To arrive at a final assignment, after the assignment probabilities become stable you could choose to assign each point to the model with highest probability. Isabel
