GitHub user rnowling opened a pull request:
https://github.com/apache/spark/pull/1248
[SPARK-2308][MLLIB] Add Mini-Batch KMeans Clustering method
Mini-batch is a version of KMeans that uses a randomly-sampled subset of
the data points in each iteration instead of the full set of data points,
improving performance (and in some cases, accuracy). The mini-batch version is
compatible with the KMeans|| initialization algorithm currently implemented in
MLlib.
This PR adds the KMeansMiniBatch clustering algorithm, tests, and updates
docs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rnowling/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1248.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1248
----
commit d56aa5b22829c47d7be5c6f9c3483209502c84cc
Author: RJ Nowling <[email protected]>
Date: 2014-06-27T18:31:22Z
Added KMeansMiniBatch implementation
commit 54fabe1c7b158c64d860151ca77a410df66a6ac7
Author: RJ Nowling <[email protected]>
Date: 2014-06-27T18:36:47Z
Updated KMeansMiniBatch docs
commit 2afee1af31aeb4a542ff24628f4ed89d46e3a06f
Author: RJ Nowling <[email protected]>
Date: 2014-06-27T18:49:05Z
Added KMeansMiniBatch to docs
commit 0853adbf55a7452e1804d722b133b002d5c0ff19
Author: RJ Nowling <[email protected]>
Date: 2014-06-27T19:49:11Z
Added overloaded alternative for train()
commit fc472ca867fbe2475cbd402f32a78c1e5cb3f060
Author: RJ Nowling <[email protected]>
Date: 2014-06-27T19:49:43Z
Added KMeansMiniBatchSuite test
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---