[
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336169#comment-14336169
]
Derrick Burns commented on SPARK-3261:
--------------------------------------
One solution is to run KMeansParallel or KMeansRandom after each Lloyds round
to "replenish" empty clusters.
I have implemented the former in
https://github.com/derrickburns/generalized-kmeans-clustering.
Performance is reasonable.
Inspection reveals that the slow part of the KMeansParallel computation is the
computation of the sum of the weights of the points in each cluster.
However, the performance can be reduced by sampling the points and summing the
contributions of each sampled point. For large data sets, this approach is
appropriate.
> KMeans clusterer can return duplicate cluster centers
> -----------------------------------------------------
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.0.2
> Reporter: Derrick Burns
> Assignee: Derrick Burns
> Labels: clustering
>
> This is a bad design choice. I think that it is preferable to produce no
> duplicate cluster centers. So instead of forcing the number of clusters to be
> K, return at most K clusters.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]