[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

Derrick Burns (JIRA) Wed, 27 Aug 2014 17:08:31 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113095#comment-14113095
 ]


Derrick Burns commented on SPARK-3261:
--------------------------------------

This choice also adversely affects performance.  I just ran clustering on 1.3M 
points, asking for 10,000 clusters.  This clustering run resulted in 1019 
unique cluster centers.  The original algorithm ran for 4.5 hours.  The 
algorithm that does not allow cluster centers completed in 45 minutes for a 6x 
speedup in this dataset. 

> KMeans clusterer can return duplicate cluster centers
> -----------------------------------------------------
>
>                 Key: SPARK-3261
>                 URL: https://issues.apache.org/jira/browse/SPARK-3261
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.0.2
>            Reporter: Derrick Burns
>
> This is a bad design choice.  I think that it is preferable to produce no 
> duplicate cluster centers. So instead of forcing the number of clusters to be 
> K, return at most K clusters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-3261) KMeans clusterer can return duplicate cluster centers

Reply via email to