[ https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341299#comment-14341299 ]
Sean Owen commented on SPARK-6068: ---------------------------------- Has the test failed or is this theoretical? Fixing the implementation to guarantee this contract is ideal, if there's no real downside. Something that fails once in a blue, blue moon due to random state isn't inherently a problem, so I would not delete the test over it, no. The alternative is usually to always test the same set of random states, with a fixed seed (where that is even possible), which isn't great either. Regular failure makes it an unuseful test though. Hopefully a moot point. Derrick what PR are you having trouble with -- the big-bang multi-JIRA PR that's been going on for ages? targeted bite-size fixes to existing code here are much easier to get in. I hope you'll offer some changes for some (others) of the many JIRAs you've opened here. A lot look useful. > KMeans Parallel test may fail > ----------------------------- > > Key: SPARK-6068 > URL: https://issues.apache.org/jira/browse/SPARK-6068 > Project: Spark > Issue Type: Bug > Components: MLlib > Affects Versions: 1.2.1 > Reporter: Derrick Burns > Labels: clustering > Original Estimate: 24h > Remaining Estimate: 24h > > The test "k-means|| initialization in KMeansSuite can fail when the random > number generator is truly random. > The test is predicated on the assumption that each round of K-Means || will > add at least one new cluster center. The current implementation of K-Means > || adds 2*k cluster centers with high probability. However, there is no > deterministic lower bound on the number of cluster centers added. > Choices are: > 1) change the KMeans || implementation to iterate on selecting points until > it has satisfied a lower bound on the number of points chosen. > 2) eliminate the test > 3) ignore the problem and depend on the random number generator to sample the > space in a lucky manner. > Option (1) is most in keeping with the contract that KMeans || should provide > a precise number of cluster centers when possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org