[GitHub] spark pull request #14321: [SPARK-8971][ML] Add stratified sampling to ML Cr...

sethah Fri, 22 Jul 2016 14:42:50 -0700

GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/14321


    [SPARK-8971][ML] Add stratified sampling to ML CrossValidator and 
TrainValidationSplit

    ## What changes were proposed in this pull request?
    
    This patch adds the ability to do stratified sampling in cross validation 
for ML pipelines. This is accomplished by modifying some of the methods in 
`StratifiedSamplingUtils` to support multiple _splits_ instead of a single 
subsample of the data. A method is added to `PairRDDFunctions` to support 
`randomSplitByKey`. Please see the detailed explanation below.
    
    
    ## How was this patch tested?
    
    Unit tests were added to `PairRDDFunctionsSuite`, `MLUtilsSuite`, 
`CrossValidatorSuite`, and `TrainValidationSuite`.
    
    ## Algorithm changes
    
    Currently, Spark implements a stratified sampling function on PairRDDs 
using the method `sampleByKeyExact` and `sampleByKey`. This method calls a 
stratified sampling routine that is implemented in `StratifiedSamplingUtils`. 
The underlying algorithm is described 
[here](http://jmlr.org/proceedings/papers/v28/meng13a.pdf) in the paper by 
Xiangrui Meng. When exact samples stratified samples are required, the 
algorithm makes an extra pass through the data. Each sample is mapped on to the 
interval [0, 1] (for sampling without replacement), and we expect that, say for 
a 50% sample, we will split the interval at 0.5 and accept the samples which 
fell below that threshold. Items near 0 are highly likely to be accepted, while 
items near 1 are highly unlikely to be accepted. Items near 0.5 are uncertain, 
and are added to a waitlist on the first pass. The items in the waitlist will 
be sorted and used to determine the exact split point which produces 50/50 
sample. 
    
    
![image](https://cloud.githubusercontent.com/assets/7275795/17071720/2f90bd14-5018-11e6-91c2-2c1dc2191213.png)
    
    This patch modifies the routine to produce multiple splits by generating 
multiple waitlists on the first pass. Each waitlist is sorted to determine the 
exact split points and then we can sample as normal. 
    
    
![image](https://cloud.githubusercontent.com/assets/7275795/17071744/4f3b6d76-5018-11e6-8591-857eaaf288f9.png)
    
    One potential concern is that if this is used for a large number of splits, 
it may degrade to the point where sorting the entire dataset would be quicker, 
as the waitlists get closer and closer together. It could potentially cause OOM 
errors on the driver if there are too many waitlists collected. Still, before 
this patch there was not a way to actually take a single _split_ of the data, 
as `sampleByKey` does not return the complement of the sample. This patch fixes 
this as well.
    
    ## ML API
    
    This patch also allows users to specify a stratified column in the 
`CrossValidator` and `TrainValidationSplit` estimators. This is done by 
converting the input dataframe to a PairRDD and calling the `randomSplitByKey` 
method. This is exposed via a `setStratifiedCol` parameter which, if set, will 
use _exact_ stratified splits for cross validation. 
    
    ## Future considerations
    
    This can be implemented as a function on dataframes in the future, if there 
is interest. It is somewhat inconvient to convert the dataframe to a pair rdd, 
perform sampling, and then convert back to a dataframe. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark Working_on_SPARK-8971

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14321.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14321
    
----
commit a058cd8107666cb8bc5dd090fd1c52aadd896304
Author: sethah <[email protected]>
Date:   2015-08-08T00:04:13Z

    Adding stratified sampling to cross validation and train validation split 
in ml/tuning

commit 5f244d1cb5bd747e7383a85b54394a2fa9efa32e
Author: sethah <[email protected]>
Date:   2015-08-10T22:26:38Z

    Adding some tests and style fixes

commit 67f60027158fae37d3f3973fd22217298097ebd7
Author: sethah <[email protected]>
Date:   2016-04-22T14:51:13Z

    Refactor for efficiency when computing multiple waitlists.

commit 37be0b5c6a0a4bd6fcc4a0f59c5f575ef6f623ae
Author: sethah <[email protected]>
Date:   2016-07-22T17:16:31Z

    Move some logic back into SSUtils

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #14321: [SPARK-8971][ML] Add stratified sampling to ML Cr...

Reply via email to