GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/1407
SPARK-1215: Clustering: Index out of bounds error
Bug fix for JIRA SPARK 1215: Clustering: Index out of bounds error
https://issues.apache.org/jira/browse/SPARK-1215
Solution: Print warning, and use duplicate cluster centers so that exactly
k centers are returned.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1407.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1407
----
commit 97f2104bac2ab864c2a03f9a12a4b936557ae6d6
Author: Joseph Bradley <[email protected]>
Date: 2014-05-20T01:35:53Z
added RDD::stratifiedSample method and associated unit tests in RDDSuite.
Method is built off of RDD::takeSample method.
commit 91e83338820158b96cda492668dbed5fff33f19b
Author: Joseph Bradley <[email protected]>
Date: 2014-05-20T01:36:30Z
added RDD::stratifiedSample method documentation
commit d6f8913b7e370a82138b9c623754b32a59c21cf6
Author: Joseph Bradley <[email protected]>
Date: 2014-05-21T04:58:12Z
updated stratifiedSample to be more scalable, keeping data in RDDs instead
of collecting to the driver
commit 21eead6a412508b536358f4e557e2fab23c9c696
Author: Joseph Bradley <[email protected]>
Date: 2014-05-23T19:51:07Z
updated stratifiedSample to use selection-rejection to select samples on
each partition in 1 pass, rather than pre-selecting indices
commit 91f4b19702bc58a77d28316674eace881a81165f
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-09T21:37:25Z
merging with new spark
commit c0cb5f0d8c6104e3eb6cfa44820ba00b81bc7262
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-11T18:43:17Z
merging with updated spark
commit 7d1b812a720cffdefe78ddb6e641930e7ae4975b
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-11T18:47:41Z
removed my coding test updates
commit 18e5c8ad740871be92c6d7b73f5d35e25641a734
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-12T01:12:44Z
Added check to LocalKMeans.scala: kMeansPlusPlus initialization to handle
case with fewer distinct data points than clusters k. Added two related unit
tests to KMeansSuite.
commit e2bf638c6b3e8cc9cec3362caddb2305109d4c0a
Author: Joseph K. Bradley <[email protected]>
Date: 2014-07-14T17:52:33Z
Merge remote-tracking branch 'upstream/master'
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---