[ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336169#comment-14336169
 ] 

Derrick Burns commented on SPARK-3261:
--------------------------------------

One solution is to run KMeansParallel or KMeansRandom after each Lloyds round 
to "replenish" empty clusters.

I have implemented the former in 
https://github.com/derrickburns/generalized-kmeans-clustering.

Performance is reasonable. 

Inspection reveals that the slow part of the KMeansParallel computation is the 
computation of the sum of the weights of the points in each cluster.  

However, the performance can be reduced by sampling the points and summing the 
contributions of each sampled point. For large data sets, this approach is 
appropriate.  

> KMeans clusterer can return duplicate cluster centers
> -----------------------------------------------------
>
>                 Key: SPARK-3261
>                 URL: https://issues.apache.org/jira/browse/SPARK-3261
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Derrick Burns
>            Assignee: Derrick Burns
>              Labels: clustering
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to