Derrick Burns created SPARK-6068: ------------------------------------ Summary: KMeans Parallel test may fail Key: SPARK-6068 URL: https://issues.apache.org/jira/browse/SPARK-6068 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.1 Reporter: Derrick Burns
The test "k-means|| initialization in KMeansSuite can fail when the random number generator is truly random. The test is predicated on the assumption that each round of K-Means || will add at least one new cluster center. The current implementation of K-Means || adds 2*k cluster centers with high probability. However, there is no deterministic lower bound on the number of cluster centers added. Choices are: 1) change the KMeans || implementation to iterate on selecting points until it has satisfied a lower bound on the number of points chosen. 2) eliminate the test 3) ignore the problem and depend on the random number generator to sample the space in a lucky manner. Option (1) is most in keeping with the contract that KMeans || should provide a precise number of cluster centers when possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org