Hi, I've used out of the box current KMeansPlusPlusClusterer implementation provided by CM, however saw that it doesn't scales well on large data volumes. One of the proposals to improve current implementation was submitted in JIRA-1330 is to provide support for sparse big data, i.e. make clustering algorithm to work w/ SparseVectors.
While working on 1330, I've bumped into: Elkan, Charles. "Using the triangle inequality to accelerate k-means." ICML. Vol. 3. 2003. paper which described additional possibility of scaling up performance of kmeans algorithm. Method based on usage of triangle inequality to avoid unnecessary distance computations cause by small movement of the cluster center which doesn't affect assignment of given point to the cluster. Simply tests using PerfTestUtils shows that using Elkan's algorithm over the standard provided currently in CM could achieve performance boost in order of magnitude for significantly large inputs. I've opened MATH-1371 "Provide accelerated kmeans++ implementation" https://issues.apache.org/jira/browse/MATH-1371 and attached my implementation there. Will be glad to receive review and comments about it. I believe that switching to this algorithm instead of regular one could improve quality of CM provided solution for kmeans. Best regards, Artem Barger.