Derrick Burns created SPARK-6068:
------------------------------------

             Summary: KMeans Parallel test may fail
                 Key: SPARK-6068
                 URL: https://issues.apache.org/jira/browse/SPARK-6068
             Project: Spark
          Issue Type: Bug
          Components: MLlib
    Affects Versions: 1.2.1
            Reporter: Derrick Burns


The test  "k-means|| initialization in KMeansSuite can fail when the random 
number generator is truly random.

The test is predicated on the assumption that each round of K-Means || will add 
at least one new cluster center.  The current implementation of K-Means || adds 
2*k cluster centers with high probability.  However, there is no deterministic 
lower bound on the number of cluster centers added.

Choices are:

1)  change the KMeans || implementation to iterate on selecting points until it 
has satisfied a lower bound on the number of points chosen.

2) eliminate the test

3) ignore the problem and depend on the random number generator to sample the 
space in a lucky manner. 

Option (1) is most in keeping with the contract that KMeans || should provide a 
precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to