[
https://issues.apache.org/jira/browse/SPARK-6068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14341833#comment-14341833
]
Joseph K. Bradley commented on SPARK-6068:
------------------------------------------
[~derrickburns] I'm sorry about how it can take a long time to get a PR into
Spark, but sending small PRs with one PR per JIRA helps a lot. For a reviewer
to say "LGTM," they need to fully understand and be prepared to "own" the code,
which makes reviewing large patches *much* harder. I've spent a lot of time
breaking my patches into smaller pieces.
Looking over your JIRAs, the changes all sound useful. It also seems like the
most important change for you (supporting general Bregman divergences) could
potentially be added in spark.ml or spark.mllib without making breaking
changes. Since there is no distance metric parameter currently, adding one
based on a Bregman divergence API should be possible. However, but it's pretty
hard to figure out exactly what changes are needed because of the many issues
being addressed in your big k-means PR. A smaller PR would help a lot.
I hope it will prove worthwhile for you to help get these improvements into
MLlib, piece by piece. I don't think they will all require waiting for the
spark.ml API, but if you do want to make major API changes, then this would be
time to design the new API for the spark.ml package.
* [SPARK-6001] might require an API change since it would return a model which
could not be serialized. Perhaps it could follow a similar pattern as LDA,
which returns a DistributedLDAModel (with info about the training dataset topic
distributions), which in turn can be converted into a LocalLDAModel (which
stores model parameters locally and drops the training dataset info).
> KMeans Parallel test may fail
> -----------------------------
>
> Key: SPARK-6068
> URL: https://issues.apache.org/jira/browse/SPARK-6068
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.2.1
> Reporter: Derrick Burns
> Labels: clustering
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> The test "k-means|| initialization in KMeansSuite can fail when the random
> number generator is truly random.
> The test is predicated on the assumption that each round of K-Means || will
> add at least one new cluster center. The current implementation of K-Means
> || adds 2*k cluster centers with high probability. However, there is no
> deterministic lower bound on the number of cluster centers added.
> Choices are:
> 1) change the KMeans || implementation to iterate on selecting points until
> it has satisfied a lower bound on the number of points chosen.
> 2) eliminate the test
> 3) ignore the problem and depend on the random number generator to sample the
> space in a lucky manner.
> Option (1) is most in keeping with the contract that KMeans || should provide
> a precise number of cluster centers when possible.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]