Re: [CONF] Apache Lucene Mahout: Dirichlet Process Clustering (page created)

Jeff Eastman Sat, 15 Nov 2008 09:40:38 -0800

I'd like to personally thank Ted Dunning for guiding me down this path, I could not have done it alone. These words are mostly his, but I now think I can speak them myself with some confidence. It has been a most amazing journey. Now, on to a Hadoop implementation...


Jeff


[EMAIL PROTECTED] wrote:

Dirichlet Process Clustering (MAHOUT) created by Jeff Eastman
   
http://cwiki.apache.org/confluence/display/MAHOUT/Dirichlet+Process+Clustering

Content:
---------------------------------------------------------------------

The Dirichlet Process Clustering algorithm performs Bayesian mixture modeling.

The idea is that we use a probabilistic mixture of a number of models that we 
use to explain some observed data. Each observed data point is assumed to have 
come from one of the models in the mixture, but we don't know which.  The way 
we deal with that is to use a so-called latent parameter which specifies which 
model each data point came from.


In addition, since this is a Bayesian clustering algorithm, we don't want to 
actually commit to any single explanation, but rather to sample from the 
distribution of models and latent assignments of data points to models given 
the observed data and the prior distributions of model parameters. This 
sampling process is initialized by taking models at random from the prior 
distribution for models.

Then, we iteratively assign points to the different models using the mixture 
probabilities and the degree of fit between the point and each model expressed 
as a probability that the point was generated by that model. After points are 
assigned, new parameters for each model are sampled from the posterior 
distribution for the model parameters considering all of the observed data 
points that were assigned to the model.  Models without any data points are 
also sampled, but since they have no points assigned, the new samples are 
effectively taken from the prior distribution for model parameters.

The result is a number of samples that represent mixing probabilities, models 
and assignment of points to models. If the total number of possible models is 
substantially larger than the number that ever have points assigned to them, 
then this algorithm provides a (nearly) non-parametric clustering algorithm. 
These samples can give us interesting information that is lacking from a normal 
clustering that consists of a single assignment of points to clusters.  
Firstly, by examining the number of models in each sample that actually has any 
points assigned to it, we can get information about how many models (clusters) 
that the data support. Morevoer, by examining how often two points are assigned 
to the same model, we can get an approximate measure of how likely these points 
are to be explained by the same model.  Such soft membership information is 
difficult to come by with conventional clustering methods.

Finally, we can get an idea of the stability of how the data can be described.  
Typically, aspects of the data with lots of data available wind up with stable 
descriptions while at the edges, there are aspects that are phenomena that we 
can't really commit to a solid description, but it is still clear that the well 
supported explanations are insufficient to explain these additional aspects. 
One thing that can be difficult about these samples is that we can't always 
assign a correlation between the models in the different samples.  Probably the 
best way to do this is to look for overlap in the assignments of data 
observations to the different models.


---------------------------------------------------------------------
CONFLUENCE INFORMATION
This message is automatically generated by Confluence

Unsubscribe or edit your notifications preferences
   http://cwiki.apache.org/confluence/users/viewnotifications.action

If you think it was sent incorrectly contact one of the administrators
   http://cwiki.apache.org/confluence/administrators.action

If you want more information on Confluence, or have a bug to report see
   http://www.atlassian.com/software/confluence

PGP.sig
Description: PGP signature

Re: [CONF] Apache Lucene Mahout: Dirichlet Process Clustering (page created)

Reply via email to