[jira] [Commented] (SPARK-6068) KMeans Parallel test may fail

Joseph K. Bradley (JIRA) Sat, 28 Feb 2015 15:32:55 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341833#comment-14341833
 ]


Joseph K. Bradley commented on SPARK-6068:
------------------------------------------

[~derrickburns]  I'm sorry about how it can take a long time to get a PR into 
Spark, but sending small PRs with one PR per JIRA helps a lot.  For a reviewer 
to say "LGTM," they need to fully understand and be prepared to "own" the code, 
which makes reviewing large patches *much* harder.  I've spent a lot of time 
breaking my patches into smaller pieces.

Looking over your JIRAs, the changes all sound useful.  It also seems like the 
most important change for you (supporting general Bregman divergences) could 
potentially be added in spark.ml or spark.mllib without making breaking 
changes.  Since there is no distance metric parameter currently, adding one 
based on a Bregman divergence API should be possible.  However, but it's pretty 
hard to figure out exactly what changes are needed because of the many issues 
being addressed in your big k-means PR.  A smaller PR would help a lot.

I hope it will prove worthwhile for you to help get these improvements into 
MLlib, piece by piece.  I don't think they will all require waiting for the 
spark.ml API, but if you do want to make major API changes, then this would be 
time to design the new API for the spark.ml package.
* [SPARK-6001] might require an API change since it would return a model which 
could not be serialized.  Perhaps it could follow a similar pattern as LDA, 
which returns a DistributedLDAModel (with info about the training dataset topic 
distributions), which in turn can be converted into a LocalLDAModel (which 
stores model parameters locally and drops the training dataset info).

> KMeans Parallel test may fail
> -----------------------------
>
>                 Key: SPARK-6068
>                 URL: https://issues.apache.org/jira/browse/SPARK-6068
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 1.2.1
>            Reporter: Derrick Burns
>              Labels: clustering
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> The test  "k-means|| initialization in KMeansSuite can fail when the random 
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will 
> add at least one new cluster center.  The current implementation of K-Means 
> || adds 2*k cluster centers with high probability.  However, there is no 
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1)  change the KMeans || implementation to iterate on selecting points until 
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the 
> space in a lucky manner. 
> Option (1) is most in keeping with the contract that KMeans || should provide 
> a precise number of cluster centers when possible. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-6068) KMeans Parallel test may fail

Reply via email to