[ https://issues.apache.org/jira/browse/SPARK-3504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14150750#comment-14150750 ]
Derrick Burns commented on SPARK-3504: -------------------------------------- No, these are two separate issues. One refers to the initialization procedure and the other refers to iterative process after initialization. > KMeans clusterer is slow, can be sped up by 75% > ----------------------------------------------- > > Key: SPARK-3504 > URL: https://issues.apache.org/jira/browse/SPARK-3504 > Project: Spark > Issue Type: Improvement > Components: MLlib > Affects Versions: 1.0.2 > Reporter: Derrick Burns > > The 1.0.2 implementation of the KMeans clusterer is VERY inefficient because > recomputes all distances to all cluster centers on each iteration. In later > iterations of Lloyd's algorithm, points don't change clusters and clusters > don't move. > By 1) tracking which clusters move and 2) tracking for each point which > cluster it belongs to and the distance to that cluster, one can avoid > recomputing distances in many cases with very little increase in memory > requirements. > I implemented this new algorithm and the results were fantastic. Using 16 > c3.8xlarge machines on EC2, the clusterer converged in 13 iterations on > 1,714,654 (182 dimensional) points and 20,000 clusters in 24 minutes. Here > are the running times for the first 7 rounds: > 6 minutes and 42 second > 7 minutes and 7 seconds > 7 minutes 13 seconds > 1 minutes 18 seconds > 30 seconds > 18 seconds > 12 seconds > Without this improvement, all rounds would have taken roughly 7 minutes, > resulting in Lloyd's iterations taking 7 * 13 = 91 minutes. In other words, > this improvement resulting in a reduction of roughly 75% in running time with > no loss of accuracy. > My implementation is a rewrite of the existing 1.0.2 implementation. It is > not a simple modification of the existing implementation. Please let me know > if you are interested in this new implementation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org