Derrick Burns created SPARK-3218:
------------------------------------
Summary: K-Means clusterer can fail on degenerate data
Key: SPARK-3218
URL: https://issues.apache.org/jira/browse/SPARK-3218
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.0.2
Reporter: Derrick Burns
The KMeans parallel implementation selects points to be cluster centers with
probability weighted by their distance to cluster centers. However, if there
are fewer than k DISTINCT points in the data set, this approach will fail.
Further, the recent checkin to work around this problem results in selection of
the same point repeatedly as a cluster center.
The fix is to allow fewer than k cluster centers to be selected. This requires
several changes to the code, as the number of cluster centers is woven into the
implementation.
I have a version of the code that addresses this problem, AND generalizes the
distance metric. However, I see that there are literally hundreds of
outstanding pull requests. If someone will commit to working with me to
sponsor the pull request, I will create it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]