GitHub user dorx reopened a pull request:

    https://github.com/apache/spark/pull/1025

    [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact 
sample size

    Implemented stratified sampling that guarantees exact sample size using 
ScaRSR with two passes over the RDD for sampling without replacement and three 
passes for sampling with replacement.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/dorx/spark stratified

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/1025.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1025
    
----
commit 14419775202e6eef1f0e1f0c74c7be9030aca73d
Author: Doris Xin <[email protected]>
Date:   2014-05-29T22:22:14Z

    SPARK-1939 Refactor takeSample method in RDD to use ScaSRS

commit ffea61a67d228edb476d29ca13a84bb3f9a22887
Author: Doris Xin <[email protected]>
Date:   2014-05-30T00:55:54Z

    SPARK-1939: Refactor takeSample method in RDD
    
    Reviewer comments addressed:
    - commons-math3 is now a test-only dependency. bumped up to v3.3
    - comments added to explain what computeFraction is doing
    - fixed the unit for computeFraction to use BinomialDitro for without
    replacement sampling
    - stylistic fixes

commit 7cab53a3926f4351432e5e3600b0796b9a4146e4
Author: Doris Xin <[email protected]>
Date:   2014-06-02T19:00:38Z

    fixed import bug in rdd.py

commit e3fd6a628317d559a08a7a20421e9c0618180902
Author: Doris Xin <[email protected]>
Date:   2014-06-02T19:06:18Z

    Merge branch 'master' into takeSample

commit 9ee94ee3c28e8d808063fef4e5d39f06ab738e0b
Author: Doris Xin <[email protected]>
Date:   2014-06-09T20:15:23Z

    [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact 
sample size

commit 1d413ce877a67379a0a74afefba071c018b0ca70
Author: Doris Xin <[email protected]>
Date:   2014-06-09T21:14:26Z

    fixed checkstyle issues

commit 7e1a48182ebec54cd3a6a290b1dc27b928f57dba
Author: Doris Xin <[email protected]>
Date:   2014-06-09T21:29:30Z

    changed the permission on SamplingUtil

commit 46f6c8c86f3e1fdaf49b63796f1cd5bd6db79ec7
Author: Doris Xin <[email protected]>
Date:   2014-06-10T00:39:50Z

    fixed the NPE caused by closures being cleaned before being passed into the 
aggregate function

commit 50581fc8b08bd5f18cdf2288772c22f2549af0a5
Author: Doris Xin <[email protected]>
Date:   2014-06-12T21:36:37Z

    added a TODO for logging in python

commit 73276111a2c4a9354bfbf6414afceabf70fb21e5
Author: Doris Xin <[email protected]>
Date:   2014-06-13T00:28:14Z

    merge master

commit 9e74ab505e5441eedfed5dbfbeac37566d3de1f0
Author: Doris Xin <[email protected]>
Date:   2014-06-17T22:24:22Z

    Separated out most of the logic in sampleByKey
    
    into StratifiedSampler in util.random

commit 90d94c0d4f4d909fb99b8e72f9a09ca5329e070c
Author: Doris Xin <[email protected]>
Date:   2014-06-17T22:57:11Z

    merge master

commit 0214a7659c62e4ff0f68f6e09cd7846547cd3bcb
Author: Doris Xin <[email protected]>
Date:   2014-06-18T02:22:32Z

    cleanUp
    
    Addressed reviewer comments and added better documentation of code.
    Added commons-math3 as a dependency of spark (okay’ed by Matei). “mvm
    clean install” compiled. Recovered files that were reverted by accident
    in the merge.
    TODOs: figure out API for sampleByKeyExact and update Java, Python, and
    the markdown file accordingly.

commit 944a10cff3c218bafcb8b43e7e3f309cc644633e
Author: Doris Xin <[email protected]>
Date:   2014-06-19T02:48:22Z

    [SPARK-2145] Add lower bound on sampling rate
    
    to guarantee sampling performance

commit 1fe1cff58d99f336406f67f89f53fe6ab3bfde5e
Author: Doris Xin <[email protected]>
Date:   2014-06-19T19:46:59Z

    Changed fractionByKey to a map to enable arg check

commit bd9dc6e08444ec83c5d6adc4c452b49d1ac2b154
Author: Doris Xin <[email protected]>
Date:   2014-06-19T20:00:47Z

    unit bug and style violation fixed

commit 4ad516b14559f06a32d65ef0ce2fa2d7526610bc
Author: Doris Xin <[email protected]>
Date:   2014-06-20T18:38:09Z

    remove unused imports from PairRDDFunctions

commit 254e03c96e1f2aaa5baa9c3d384adeb117e0b7ab
Author: Doris Xin <[email protected]>
Date:   2014-07-03T20:49:46Z

    minor fixes and Java API.
    
    punting on python for now. moved aggregateWithContext out of RDD

commit 6b5b10b71669590f862ec2b7d004ba976d59676c
Author: Doris Xin <[email protected]>
Date:   2014-07-03T20:58:29Z

    Merge branch 'master' into stratified

commit ee9d260e5eb6ee2d6912e57491daeded1704248c
Author: Doris Xin <[email protected]>
Date:   2014-07-06T23:20:39Z

    addressed reviewer comments

commit bbfb8c91a68deae076db7d9fdc518b6cee8a99f1
Author: Doris Xin <[email protected]>
Date:   2014-07-06T23:21:20Z

    Merge branch 'master' into stratified

commit 9884a9f03b18f3170d90d168c246d78fc463028e
Author: Doris Xin <[email protected]>
Date:   2014-07-08T18:58:39Z

    style fix

commit 680b677bc5276e1499c59c7e24abfae7d85e5c7d
Author: Doris Xin <[email protected]>
Date:   2014-07-08T23:49:25Z

    use mapPartitionWithIndex instead
    
    also better readability and lots more comments.

commit a2bf756454b2ae6b71ddc67457c0f69f165b937d
Author: Doris Xin <[email protected]>
Date:   2014-07-08T23:50:17Z

    Merge branch 'master' into stratified

commit a10e68dd9c23b0fc1f25effd9ccd0ac3e7299206
Author: Doris Xin <[email protected]>
Date:   2014-07-09T01:09:14Z

    style fix

commit f4c21f324075bc5e7b8cba07a5c47d23fade542f
Author: Doris Xin <[email protected]>
Date:   2014-07-15T01:34:10Z

    Reviewer comments
    
    Added BernoulliBounds

commit eecee5fb31aa678db872eccbd7385461ea3a6a96
Author: Doris Xin <[email protected]>
Date:   2014-07-15T02:56:58Z

    Merge branch 'master' into stratified
    
    Conflicts:
        project/SparkBuild.scala

commit 1f9e2654921194baba29d58b862f21bd53c29470
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-24T04:15:05Z

    merge master

commit bd6652d8e69c18dc8ee60ae00272fb99958bfae2
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-24T04:15:54Z

    change commons-math3 scope to test

commit 3f6f0b784a0456f98619e40dd97a04352efd2192
Author: Xiangrui Meng <[email protected]>
Date:   2014-07-24T12:41:58Z

    refractor stratified sampling impl

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to