Derrick Burns created SPARK-6068:
------------------------------------
Summary: KMeans Parallel test may fail
Key: SPARK-6068
URL: https://issues.apache.org/jira/browse/SPARK-6068
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.2.1
Reporter: Derrick Burns
The test "k-means|| initialization in KMeansSuite can fail when the random
number generator is truly random.
The test is predicated on the assumption that each round of K-Means || will add
at least one new cluster center. The current implementation of K-Means || adds
2*k cluster centers with high probability. However, there is no deterministic
lower bound on the number of cluster centers added.
Choices are:
1) change the KMeans || implementation to iterate on selecting points until it
has satisfied a lower bound on the number of points chosen.
2) eliminate the test
3) ignore the problem and depend on the random number generator to sample the
space in a lucky manner.
Option (1) is most in keeping with the contract that KMeans || should provide a
precise number of cluster centers when possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]