Hi,

I've used out of the box current KMeansPlusPlusClusterer implementation
provided by CM, however saw that it doesn't scales well on large data
volumes. One of the proposals to improve current implementation was
submitted in JIRA-1330 is to provide support for sparse big data, i.e. make
clustering algorithm to work w/ SparseVectors.

While working on 1330, I've bumped into: Elkan, Charles. "Using the
triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper which
described additional possibility of scaling up performance of kmeans
algorithm. Method based on usage of triangle inequality to avoid
unnecessary distance computations cause by small movement of the cluster
center which doesn't affect assignment of given point to the cluster.

Simply tests using PerfTestUtils shows that using Elkan's algorithm over
the standard provided currently in CM could achieve performance boost in
order of magnitude for significantly large inputs.

I've opened MATH-1371 "Provide accelerated kmeans++ implementation"
https://issues.apache.org/jira/browse/MATH-1371 and attached my
implementation there. Will be glad to receive review and comments about it.

I believe that switching to this algorithm instead of regular one could
improve quality of CM provided solution for kmeans.

Best regards,
                      Artem Barger.

Reply via email to