GitHub user dorx opened a pull request:
https://github.com/apache/spark/pull/916
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Modified the takeSample method in RDD to use the ScaSRS sampling technique
to improve performance. Added a private method that computes sampling rate >
sample_size/total to ensure sufficient sample size with success rate >= 0.9999.
Added a unit test for the private method to validate choice of sampling rate.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx/spark takeSample
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/916.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #916
----
commit 14419775202e6eef1f0e1f0c74c7be9030aca73d
Author: Doris Xin <[email protected]>
Date: 2014-05-29T22:22:14Z
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---