[
https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Patrick Wendell updated SPARK-3504:
-----------------------------------
Summary: KMeans optimization: track distances and unmoved cluster centers
across iterations (was: KMeans clusterer is slow, can be sped up by 75%)
> KMeans optimization: track distances and unmoved cluster centers across
> iterations
> ----------------------------------------------------------------------------------
>
> Key: SPARK-3504
> URL: https://issues.apache.org/jira/browse/SPARK-3504
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 1.0.2
> Reporter: Derrick Burns
>
> The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because
> recomputes all distances to all cluster centers on each iteration. In later
> iterations of Lloyd's algorithm, points don't change clusters and clusters
> don't move.
> By 1) tracking which clusters move and 2) tracking for each point which
> cluster it belongs to and the distance to that cluster, one can avoid
> recomputing distances in many cases with very little increase in memory
> requirements.
> I implemented this new algorithm and the results were fantastic. Using 16
> c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on
> 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here
> are the running times for the first 7 rounds:
> 6 minutes and 42 second
> 7 minutes and 7 seconds
> 7 minutes 13 seconds
> 1 minutes 18 seconds
> 30 seconds
> 18 seconds
> 12 seconds
> Without this improvement, all rounds would have taken roughly 7 minutes,
> resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words,
> this improvement resulting in a reduction of roughly 75% in running time with
> no loss of accuracy.
> My implementation is a rewrite of the existing 1.0.2 implementation. It is
> not a simple modification of the existing implementation. Please let me know
> if you are interested in this new implementation.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]