[
https://issues.apache.org/jira/browse/MAHOUT-30?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Eastman updated MAHOUT-30:
-------------------------------
Attachment: dirichlet.tar
This file contains a tar of a standalone Eclipse project named Dirichlet
Cluster. The directory structure is self-contained and only depends upon the
Mahout project. It uses the Gson beta1.3 jar file for
serialization/deserialization and that is included.
This version contains an initial Hadoop implementation that has been tested
through 10 iterations. In each iteration, clusters are read from the previous
iteration (in the state-i directories) and points are assigned to clusters in
the Mapper. The reducer then observes all points for each cluster and computes
new clusters which are output for the subsequent iteration. The unit test
TestMapReduce.testDriverMRIterations() creates 400 data points and runs the MR
Driver. Then it gathers all of the state files and summarizes them on the
console.
I noticed when I replaced the beta distribution code earlier that the
clustering now tends to put everything into the same cluster. I'm suspicious
about the beta values that are being computed and need to investigate this
further.
I think this design will allow an arbitrary number of Mappers and Reducers up
to the number of clusters. There is a stub Combiner class that is not currently
used. I will continue to develop unit tests but I wanted to get this into view
because it is a real first light MR implementation.
Jeff
> dirichlet process implementation
> --------------------------------
>
> Key: MAHOUT-30
> URL: https://issues.apache.org/jira/browse/MAHOUT-30
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Reporter: Isabel Drost
> Assignee: Jeff Eastman
> Attachments: dirichlet.tar, MAHOUT-30.patch, MAHOUT-30b.patch,
> MAHOUT-30c.patch, MAHOUT-30d.patch, MAHOUT-30e.patch
>
>
> Copied over from original issue:
> > Further extension can also be made by assuming an infinite mixture model.
> > The implementation is only slightly more difficult and the result is a
> > (nearly)
> > non-parametric clustering algorithm.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.