yu zhang created MATH-1465:
------------------------------

             Summary: Increase the initial sampling speed of 
KMeansPlusPlusClusterer for large k
                 Key: MATH-1465
                 URL: https://issues.apache.org/jira/browse/MATH-1465
             Project: Commons Math
          Issue Type: Improvement
    Affects Versions: 3.6.1
            Reporter: yu zhang


As the major difference of KMeans++ to classic Kmeans, an initial distance 
square sampling process is executed, the function is named 
"chooseInitialCenters" in the current implementation. The time complexity is 
O(dkn), where d is the dimension, k is the cluster number and n is the number 
of points.

In paper "_Ackermann, Lammersen, Märtens, Raupach, Sohler, Swierkot. 
StreamKM++: A Clustering Algorithm for Data Streams. In Proceedings of the 12th 
Workshop on Algorithm Engineering and Experiments (ALENEX 2010)_", the authors 
introduced a data structure named "coreset tree" which can reduce the 
complexity of the initial sampling to O(dn log k). This is useful in the 
scenario when value of k is large, say 1000, then the run speed of the 
algorithm will be much faster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to