GitHub user dorx reopened a pull request:
https://github.com/apache/spark/pull/1025
[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact
sample size
Implemented stratified sampling that guarantees exact sample size using
ScaRSR with two passes over the RDD for sampling without replacement and three
passes for sampling with replacement.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/dorx/spark stratified
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/1025.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1025
----
commit 14419775202e6eef1f0e1f0c74c7be9030aca73d
Author: Doris Xin <[email protected]>
Date: 2014-05-29T22:22:14Z
SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
commit ffea61a67d228edb476d29ca13a84bb3f9a22887
Author: Doris Xin <[email protected]>
Date: 2014-05-30T00:55:54Z
SPARK-1939: Refactor takeSample method in RDD
Reviewer comments addressed:
- commons-math3 is now a test-only dependency. bumped up to v3.3
- comments added to explain what computeFraction is doing
- fixed the unit for computeFraction to use BinomialDitro for without
replacement sampling
- stylistic fixes
commit 7cab53a3926f4351432e5e3600b0796b9a4146e4
Author: Doris Xin <[email protected]>
Date: 2014-06-02T19:00:38Z
fixed import bug in rdd.py
commit e3fd6a628317d559a08a7a20421e9c0618180902
Author: Doris Xin <[email protected]>
Date: 2014-06-02T19:06:18Z
Merge branch 'master' into takeSample
commit 9ee94ee3c28e8d808063fef4e5d39f06ab738e0b
Author: Doris Xin <[email protected]>
Date: 2014-06-09T20:15:23Z
[SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact
sample size
commit 1d413ce877a67379a0a74afefba071c018b0ca70
Author: Doris Xin <[email protected]>
Date: 2014-06-09T21:14:26Z
fixed checkstyle issues
commit 7e1a48182ebec54cd3a6a290b1dc27b928f57dba
Author: Doris Xin <[email protected]>
Date: 2014-06-09T21:29:30Z
changed the permission on SamplingUtil
commit 46f6c8c86f3e1fdaf49b63796f1cd5bd6db79ec7
Author: Doris Xin <[email protected]>
Date: 2014-06-10T00:39:50Z
fixed the NPE caused by closures being cleaned before being passed into the
aggregate function
commit 50581fc8b08bd5f18cdf2288772c22f2549af0a5
Author: Doris Xin <[email protected]>
Date: 2014-06-12T21:36:37Z
added a TODO for logging in python
commit 73276111a2c4a9354bfbf6414afceabf70fb21e5
Author: Doris Xin <[email protected]>
Date: 2014-06-13T00:28:14Z
merge master
commit 9e74ab505e5441eedfed5dbfbeac37566d3de1f0
Author: Doris Xin <[email protected]>
Date: 2014-06-17T22:24:22Z
Separated out most of the logic in sampleByKey
into StratifiedSampler in util.random
commit 90d94c0d4f4d909fb99b8e72f9a09ca5329e070c
Author: Doris Xin <[email protected]>
Date: 2014-06-17T22:57:11Z
merge master
commit 0214a7659c62e4ff0f68f6e09cd7846547cd3bcb
Author: Doris Xin <[email protected]>
Date: 2014-06-18T02:22:32Z
cleanUp
Addressed reviewer comments and added better documentation of code.
Added commons-math3 as a dependency of spark (okayâed by Matei). âmvm
clean installâ compiled. Recovered files that were reverted by accident
in the merge.
TODOs: figure out API for sampleByKeyExact and update Java, Python, and
the markdown file accordingly.
commit 944a10cff3c218bafcb8b43e7e3f309cc644633e
Author: Doris Xin <[email protected]>
Date: 2014-06-19T02:48:22Z
[SPARK-2145] Add lower bound on sampling rate
to guarantee sampling performance
commit 1fe1cff58d99f336406f67f89f53fe6ab3bfde5e
Author: Doris Xin <[email protected]>
Date: 2014-06-19T19:46:59Z
Changed fractionByKey to a map to enable arg check
commit bd9dc6e08444ec83c5d6adc4c452b49d1ac2b154
Author: Doris Xin <[email protected]>
Date: 2014-06-19T20:00:47Z
unit bug and style violation fixed
commit 4ad516b14559f06a32d65ef0ce2fa2d7526610bc
Author: Doris Xin <[email protected]>
Date: 2014-06-20T18:38:09Z
remove unused imports from PairRDDFunctions
commit 254e03c96e1f2aaa5baa9c3d384adeb117e0b7ab
Author: Doris Xin <[email protected]>
Date: 2014-07-03T20:49:46Z
minor fixes and Java API.
punting on python for now. moved aggregateWithContext out of RDD
commit 6b5b10b71669590f862ec2b7d004ba976d59676c
Author: Doris Xin <[email protected]>
Date: 2014-07-03T20:58:29Z
Merge branch 'master' into stratified
commit ee9d260e5eb6ee2d6912e57491daeded1704248c
Author: Doris Xin <[email protected]>
Date: 2014-07-06T23:20:39Z
addressed reviewer comments
commit bbfb8c91a68deae076db7d9fdc518b6cee8a99f1
Author: Doris Xin <[email protected]>
Date: 2014-07-06T23:21:20Z
Merge branch 'master' into stratified
commit 9884a9f03b18f3170d90d168c246d78fc463028e
Author: Doris Xin <[email protected]>
Date: 2014-07-08T18:58:39Z
style fix
commit 680b677bc5276e1499c59c7e24abfae7d85e5c7d
Author: Doris Xin <[email protected]>
Date: 2014-07-08T23:49:25Z
use mapPartitionWithIndex instead
also better readability and lots more comments.
commit a2bf756454b2ae6b71ddc67457c0f69f165b937d
Author: Doris Xin <[email protected]>
Date: 2014-07-08T23:50:17Z
Merge branch 'master' into stratified
commit a10e68dd9c23b0fc1f25effd9ccd0ac3e7299206
Author: Doris Xin <[email protected]>
Date: 2014-07-09T01:09:14Z
style fix
commit f4c21f324075bc5e7b8cba07a5c47d23fade542f
Author: Doris Xin <[email protected]>
Date: 2014-07-15T01:34:10Z
Reviewer comments
Added BernoulliBounds
commit eecee5fb31aa678db872eccbd7385461ea3a6a96
Author: Doris Xin <[email protected]>
Date: 2014-07-15T02:56:58Z
Merge branch 'master' into stratified
Conflicts:
project/SparkBuild.scala
commit 1f9e2654921194baba29d58b862f21bd53c29470
Author: Xiangrui Meng <[email protected]>
Date: 2014-07-24T04:15:05Z
merge master
commit bd6652d8e69c18dc8ee60ae00272fb99958bfae2
Author: Xiangrui Meng <[email protected]>
Date: 2014-07-24T04:15:54Z
change commons-math3 scope to test
commit 3f6f0b784a0456f98619e40dd97a04352efd2192
Author: Xiangrui Meng <[email protected]>
Date: 2014-07-24T12:41:58Z
refractor stratified sampling impl
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---