[
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113095#comment-14113095
]
Derrick Burns commented on SPARK-3261:
--------------------------------------
This choice also adversely affects performance. I just ran clustering on 1.3M
points, asking for 10,000 clusters. This clustering run resulted in 1019
unique cluster centers. The original algorithm ran for 4.5 hours. The
algorithm that does not allow cluster centers completed in 45 minutes for a 6x
speedup in this dataset.
> KMeans clusterer can return duplicate cluster centers
> -----------------------------------------------------
>
> Key: SPARK-3261
> URL: https://issues.apache.org/jira/browse/SPARK-3261
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.0.2
> Reporter: Derrick Burns
>
> This is a bad design choice. I think that it is preferable to produce no
> duplicate cluster centers. So instead of forcing the number of clusters to be
> K, return at most K clusters.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]