GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/14321
[SPARK-8971][ML] Add stratified sampling to ML CrossValidator and
TrainValidationSplit
## What changes were proposed in this pull request?
This patch adds the ability to do stratified sampling in cross validation
for ML pipelines. This is accomplished by modifying some of the methods in
`StratifiedSamplingUtils` to support multiple _splits_ instead of a single
subsample of the data. A method is added to `PairRDDFunctions` to support
`randomSplitByKey`. Please see the detailed explanation below.
## How was this patch tested?
Unit tests were added to `PairRDDFunctionsSuite`, `MLUtilsSuite`,
`CrossValidatorSuite`, and `TrainValidationSuite`.
## Algorithm changes
Currently, Spark implements a stratified sampling function on PairRDDs
using the method `sampleByKeyExact` and `sampleByKey`. This method calls a
stratified sampling routine that is implemented in `StratifiedSamplingUtils`.
The underlying algorithm is described
[here](http://jmlr.org/proceedings/papers/v28/meng13a.pdf) in the paper by
Xiangrui Meng. When exact samples stratified samples are required, the
algorithm makes an extra pass through the data. Each sample is mapped on to the
interval [0, 1] (for sampling without replacement), and we expect that, say for
a 50% sample, we will split the interval at 0.5 and accept the samples which
fell below that threshold. Items near 0 are highly likely to be accepted, while
items near 1 are highly unlikely to be accepted. Items near 0.5 are uncertain,
and are added to a waitlist on the first pass. The items in the waitlist will
be sorted and used to determine the exact split point which produces 50/50
sample.

This patch modifies the routine to produce multiple splits by generating
multiple waitlists on the first pass. Each waitlist is sorted to determine the
exact split points and then we can sample as normal.

One potential concern is that if this is used for a large number of splits,
it may degrade to the point where sorting the entire dataset would be quicker,
as the waitlists get closer and closer together. It could potentially cause OOM
errors on the driver if there are too many waitlists collected. Still, before
this patch there was not a way to actually take a single _split_ of the data,
as `sampleByKey` does not return the complement of the sample. This patch fixes
this as well.
## ML API
This patch also allows users to specify a stratified column in the
`CrossValidator` and `TrainValidationSplit` estimators. This is done by
converting the input dataframe to a PairRDD and calling the `randomSplitByKey`
method. This is exposed via a `setStratifiedCol` parameter which, if set, will
use _exact_ stratified splits for cross validation.
## Future considerations
This can be implemented as a function on dataframes in the future, if there
is interest. It is somewhat inconvient to convert the dataframe to a pair rdd,
perform sampling, and then convert back to a dataframe.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark Working_on_SPARK-8971
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14321.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14321
----
commit a058cd8107666cb8bc5dd090fd1c52aadd896304
Author: sethah <[email protected]>
Date: 2015-08-08T00:04:13Z
Adding stratified sampling to cross validation and train validation split
in ml/tuning
commit 5f244d1cb5bd747e7383a85b54394a2fa9efa32e
Author: sethah <[email protected]>
Date: 2015-08-10T22:26:38Z
Adding some tests and style fixes
commit 67f60027158fae37d3f3973fd22217298097ebd7
Author: sethah <[email protected]>
Date: 2016-04-22T14:51:13Z
Refactor for efficiency when computing multiple waitlists.
commit 37be0b5c6a0a4bd6fcc4a0f59c5f575ef6f623ae
Author: sethah <[email protected]>
Date: 2016-07-22T17:16:31Z
Move some logic back into SSUtils
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]