Hi Ted,

Indeed, it was precisely all that sampling that confounded me for so long, especially in untyped R. All the other clustering algorithms can be thought of as sampling too, but their pdfs ==1 for the model chosen. I think it was actually a conversation with a statistics guy at Yahoo! when I gave a Mahout intro last summer that got me thinking outside of that box. He noted that, for large data sets, it is really not necessary to process all the points to get meaningful clusters; just to sample from them. That took a few months to really sink in and then the ah-ha happened. I think I did a posting to this list at that point. Your refactoring of my initial abstractions cemented the deal :).

If I continue down the path of using running sums to compute the new model parameters, I think I can eliminate materializing the set of points that are assigned to each model in recomputeModels(). I need to add an observe() method to the Model interface and do some more refactoring of ModelDistribution and it is all a little half-baked right now. I'll post that to Jira if I get it working, but the basic idea would be to create a new set of prior models in the assignPointsToModels() method, ask the assigned model to observe() as I iterate through the points, and then just compute posterior parameters in recomputeModels. Of course, I'll have to figure out how to compute the new mixtures differently, without z, but I have some ideas.

I'll keep you all posted,
Jeff


Ted Dunning (JIRA) wrote:
[ https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12646967#action_12646967 ]
Ted Dunning commented on MAHOUT-30:
-----------------------------------

Jeff,

These look like really nice refactorings.  The process is nice and clear.

The only key trick that may confuse people is that each step is a sampling.  
Thus assignment to clusters does NOT assign to the best cluster, it picks a 
cluster at random, biased by the mixture parameters and model pdf's.  Likewise, 
model computation does NOT compute the best model, it samples from the 
distribution given by the data.  Same is true for the mixture parameters.

Your code does this. I just think that this is a hard point for people to understand in these techniques.
dirichlet process implementation
--------------------------------

                Key: MAHOUT-30
                URL: https://issues.apache.org/jira/browse/MAHOUT-30
            Project: Mahout
         Issue Type: New Feature
         Components: Clustering
           Reporter: Isabel Drost
        Attachments: MAHOUT-30.patch


Copied over from original issue:
Further extension can also be made by assuming an infinite mixture model. The 
implementation is only slightly more difficult and the result is a (nearly)
non-parametric clustering algorithm.


Attachment: PGP.sig
Description: PGP signature

Reply via email to