The idea behind the Dirichlet Process clustering is that it abstracts the
ideas behind k-means and makes the algorithm non-deterministic in cluster
assignment. This has many good effects. One is that we can assign k to
have a fairly large value and the algorithm will give us a distribution of
reasonable values for k. Another is that we can inject distributions other
than the implied symmetric Gaussian that k-means uses. Another benefit is
that the distributions used, even if Gaussian, can have a number of
additional parameters which allows much more sophisticated clustering. For
instance long and skinny Gaussian clusters are possible.
The way that the algorithm works is that you specify three parameters:
k - the maximum number of clusters possible
model - the probability distribution for the clusters
prior - the distribution of models we would expect with no data. This is
what makes the algorithm converge even if you allow lots of variability.
The algorithm is similar to k-means:
for each iteration:
for each point:
compute the probabilities that the point belongs to any known cluster
or to a new cluster
only allow a new cluster if we know about fewer than k clusters
assign the point at random according to these probabilities
for each cluster:
estimate the new model parameters from the points that were assigned
to the cluster
In the map-reduce implementation, the first inner loop is the mapper and
the second is the reducer.
One key problem is that the reducer processes every point. We could use a
combiner if the model is of a special form. In particular, if we can
create new models from a single data point and the prior and if we can
"add" models together, then we can use a combiner. The mapper would emit a
single point model and the combiner would take many models and add them
together. For some kinds of models, notably all of the ones from the
exponential class, there exist sufficient statistics and the combination of
models really is a lot like addition. Most of the uses of DP clustering
involve exponential models like the normal distribution so the world is a
happy place.
On Wed, Nov 2, 2011 at 2:13 PM, Grant Ingersoll <[email protected]> wrote:
> Tim Potter and I have tried running Dirchlet in the past on the ASF email
> set on EC2 and it didn't seem to scale all that well, so I was wondering if
> people had ideas on improving it's speed. One question I had is whether we
> could inject a Combiner into the process? Ted also mentioned that there
> might be faster ways to check the models, but I will ask him to elaborate.
>
> Thanks,
> Grant