Derrick Burns created SPARK-3504:
------------------------------------

             Summary: KMeans clusterer is slow, can be sped up by 75%
                 Key: SPARK-3504
                 URL: https://issues.apache.org/jira/browse/SPARK-3504
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 1.0.2
            Reporter: Derrick Burns


The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because 
recomputes all distances to all cluster centers on each iteration.  In later 
iterations of Lloyd's algorithm, points don't change clusters and clusters 
don't move.  

By 1) tracking which clusters move and 2) tracking for each point which cluster 
it belongs to and the distance to that cluster, one can avoid recomputing 
distances in many cases with very little increase in memory requirements. 

I implemented this new algorithm and the results were fantastic. Using 16 
c3.8xlarge machines on EC2,  the clusterer converged in 13 iterations on 
1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here are 
the running times for the first 7 rounds:

6 minutes and 42 second
7 minutes and 7 seconds
7 minutes 13 seconds
1 minutes 18 seconds
30 seconds
18 seconds
12 seconds

Without this improvement, all rounds would have taken roughly 7 minutes, 
resulting in Lloyd's iterations taking  7 * 13 = 91 minutes. In other words, 
this improvement resulting in a reduction of roughly 75% in running time with 
no loss of accuracy.

My implementation is a rewrite of the existing 1.0.2 implementation.  It is not 
a simple modification of the existing implementation.  Please let me know if 
you are interested in this new implementation.  




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to