[ 
https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Eastman updated MAHOUT-30:
-------------------------------

    Attachment: dirichlet.tar

This file contains a tar of a standalone Eclipse project named Dirichlet 
Cluster. The directory structure is self-contained and only depends upon the 
Mahout project. It uses the Gson beta1.3 jar file for 
serialization/deserialization and that is included. 

This version contains an initial Hadoop implementation that has been tested 
through 10 iterations. In each iteration, clusters are read from the previous 
iteration (in the state-i directories) and points are assigned to clusters in 
the Mapper. The reducer then observes all points for each cluster and computes 
new clusters which are output for the subsequent iteration. The unit test 
TestMapReduce.testDriverMRIterations() creates 400 data points and runs the MR 
Driver. Then it gathers all of the state files and summarizes them on the 
console.

I noticed when I replaced the beta distribution code earlier that the 
clustering now tends to put everything into the same cluster. I'm suspicious 
about the beta values that are being computed and need to investigate this 
further. 

I think this design will allow an arbitrary number of Mappers and Reducers up 
to the number of clusters. There is a stub Combiner class that is not currently 
used. I will continue to develop unit tests but I wanted to get this into view 
because it is a real first light MR implementation.

Jeff

> dirichlet process implementation
> --------------------------------
>
>                 Key: MAHOUT-30
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-30
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Clustering
>            Reporter: Isabel Drost
>            Assignee: Jeff Eastman
>         Attachments: dirichlet.tar, MAHOUT-30.patch, MAHOUT-30b.patch, 
> MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model. 
> > The implementation is only slightly more difficult and the result is a 
> > (nearly)
> > non-parametric clustering algorithm.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to