GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/1025
[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact
sample size
Implemented stratified sampling that guarantees exact sample size using
ScaRSR with two passes over the RDD for sampling without replacement and three
passes for sampling with replacement.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx/spark stratified
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1025.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1025
----
commit 14419775202e6eef1f0e1f0c74c7be9030aca73d
Author: Doris Xin <[email protected]>
Date: 2014-05-29T22:22:14Z
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
commit ffea61a67d228edb476d29ca13a84bb3f9a22887
Author: Doris Xin <[email protected]>
Date: 2014-05-30T00:55:54Z
SPARK-1939: Refactor takeSample method in RDD
Reviewer comments addressed:
- commons-math3 is now a test-only dependency. bumped up to v3.3
- comments added to explain what computeFraction is doing
- fixed the unit for computeFraction to use BinomialDitro for without
replacement sampling
- stylistic fixes
commit 7cab53a3926f4351432e5e3600b0796b9a4146e4
Author: Doris Xin <[email protected]>
Date: 2014-06-02T19:00:38Z
fixed import bug in rdd.py
commit e3fd6a628317d559a08a7a20421e9c0618180902
Author: Doris Xin <[email protected]>
Date: 2014-06-02T19:06:18Z
Merge branch 'master' into takeSample
commit 9ee94ee3c28e8d808063fef4e5d39f06ab738e0b
Author: Doris Xin <[email protected]>
Date: 2014-06-09T20:15:23Z
[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact
sample size
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---