[ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341299#comment-14341299
 ] 

Sean Owen commented on SPARK-6068:
----------------------------------

Has the test failed or is this theoretical?
Fixing the implementation to guarantee this contract is ideal, if there's no 
real downside. 

Something that fails once in a blue, blue moon due to random state isn't 
inherently a problem, so I would not delete the test over it, no. The 
alternative is usually to always test the same set of random states, with a 
fixed seed (where that is even possible), which isn't great either. Regular 
failure makes it an unuseful test though. Hopefully a moot point.

Derrick what PR are you having trouble with -- the big-bang multi-JIRA PR 
that's been going on for ages? targeted bite-size fixes to existing code here 
are much easier to get in. I hope you'll offer some changes for some (others) 
of the many JIRAs you've opened here. A lot look useful.

> KMeans Parallel test may fail
> -----------------------------
>
>                 Key: SPARK-6068
>                 URL: https://issues.apache.org/jira/browse/SPARK-6068
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>              Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to