Clustering in the k-means style can be viewed as fitting a mixture model to the data. A mixture model is a model of the generation of random data where you first pick one of n sub-models and then generate a point from the sub-model. There are parameters that express the probability of each sub-model and then the parameters for each sub-model.
With k-means, you don't explicitly compute the mixing parameters and you assume that the sub-models are symmetric Gaussians with fixed variance. One of the simplest algorithms for this is the so-called E-M algorithm. For mixture distributions, this algorithm breaks into two passes. The first assigns data points to models, the second estimates each sub-model from the data assigned to that model. This algorithm is easy to implement and works well in some cases, particularly for k-means. Unfortunately, the E-M algorithm as it stands is doing what is called a maximum likelihood fit. If you take k-means and try to use a more elaborate model than identical Gaussians, the use of maximum likelihood instantly causes horrible problems because if you allow the variance to vary, the algorithm will position some clusters on a single point and drive the variance to zero. The result is some point-like clusters that each describe a single observation, and a few diffuse clusters to handle the rest of the data. Generalization goes out the window because the point-like clusters won't describe new data at all. So... back to your question. k-means, LDA and the Dirichlet clustering are all implementations of E-M algorithms for fitting mixture distributions (or first cousins to avoid the max-likelihood traps). This allows very nice mathematical frameworks to be brought to bear on the problem of how to make these algorithms work well. For the second question, the distinction is based on the fact that k-means is pretty inflexible with regard to the sub-models because of the maximum likelihood problems. The Dirichlet clustering tackles this problem head-on and thus allows much more general models. This transition from purely distance based metaphors to probabilistic mixture models as a fundamental metaphor for clustering is a difficult one to make. You can work to interpret the mixture models in terms of the distance based metaphor, but I find that counter-productive because the analogies used lead to some very dangerous algorithms without obvious repair. Viewing the problem as a statistical learning problems not only makes the extension of the algorithms to new domains easier, but it provides nice fixes to the problems encountered. If you want to stick with the distance (similarity) based metaphor, you can view the probability p(X | m_i) of an observed data point X given the i-th model m_i as the similarity of the data X to model m_i. Then the estimation of the new version of the model from a number of data points can be considered to be the analog of the centroid computation in k-means. On Sun, Jan 3, 2010 at 8:10 AM, Grant Ingersoll <[email protected]> wrote: > > On Dec 31, 2009, at 3:39 PM, Ted Dunning wrote: > > > > - can the clustering algorithm be viewed in a probabilistic framework > > (k-means, LDA, Dirichlet = yes, agglomerative clustering using nearest > > neighbors = not so much) > > > > - is the definition of a cluster abstract enough to be flexible with > regard > > to whether a cluster is a model or does it require stronger limits. > > (k-means = symmetric Gaussian with equal variance, Dirichlet = almost any > > probabilistic model) > > Can you elaborate a bit more on these two? I can see a bit on the > probability side, as those approaches play a factor in how similarity is > determined, but I don't get the significance of "cluster as a model". Is it > just a simplification that then makes it easier to ask: does this document > fit into the model? > > -Grant -- Ted Dunning, CTO DeepDyve
