[
https://issues.apache.org/jira/browse/MAHOUT-1162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13600557#comment-13600557
]
Dan Filimon commented on MAHOUT-1162:
-------------------------------------
There is some debate with Ted (spoken offline) regarding what the best default
values for beta, clusterOvershoot and clusterLogFactor should be.
Particularly regarding clusterOvershoot. Preliminary tests (with values 1.2,
1.5 and 2) seem to indicate not too much changes.
Some more extensive test might need to be performed. Suggestions are welcome!
> Adding BallKMeans and StreamingKMeans classes
> ---------------------------------------------
>
> Key: MAHOUT-1162
> URL: https://issues.apache.org/jira/browse/MAHOUT-1162
> Project: Mahout
> Issue Type: New Feature
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Dan Filimon
> Attachments: MAHOUT_1162.patch, MAHOUT_1162_test.patch
>
>
> Adding BallKMeans and StreamingKMeans clustering algorithms.
> These both implement Iterable<Centroid> and thus return the resulting
> centroids after clustering.
> BallKMeans implements:
> - kmeans++ initialization;
> - a normal k-means pass;
> - a trimming threshold so that points that are too far from the cluster they
> were assigned to are not used in the new centroid computation.
> StreamingKMeans implements
> [http://books.nips.cc/papers/files/nips24/NIPS2011_1271.pdf]:
> - an online clustering algorithm that takes each point into account one by one
> - for each point, it computes the distance to the nearest existing cluster
> - if the distance is greater than a set distanceCutoff, it will create a
> new cluster, otherwise it might be added to the cluster it's closest to
> (proportional to the value of the distance / distanceCutoff)
> - if there are too many clusters, the clusters will be *collapsed* (the
> same method gets called, but the number of clusters is re-adjusted)
> - finally, *about as many* clusters as requested are returned (not precise!);
> this represents a sketch of the original points.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira