[
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341128#comment-14341128
]
Joseph K. Bradley commented on SPARK-6068:
------------------------------------------
Yes, unit tests should not be flaky.
True, fixed seeds are a bit of a hack but have worked pretty well so far.
That would be great if you fixed the implementation to prevent low-likelihood
failures.
> KMeans Parallel test may fail
> -----------------------------
>
> Key: SPARK-6068
> URL: https://issues.apache.org/jira/browse/SPARK-6068
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.2.1
> Reporter: Derrick Burns
> Labels: clustering
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> The test "k-means|| initialization in KMeansSuite can fail when the random
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will
> add at least one new cluster center. The current implementation of K-Means
> || adds 2*k cluster centers with high probability. However, there is no
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1) change the KMeans || implementation to iterate on selecting points until
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the
> space in a lucky manner.
> Option (1) is most in keeping with the contract that KMeans || should provide
> a precise number of cluster centers when possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]