GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/8112
[SPARK-8971][MLLIB][ML] Support balanced class labels when splitting
train/cross validation sets
I'm leaving a few comments about some of the design choices made in this PR.
- both train/validation split and k fold require a full dataset be
partitioned into random samples. For this reason, there has to be some way to
group random keys into ranges. `sampleByKeyExact` does not currently allow this
and also has no way to return the complement of a subsample.
- the RDD package has a `randomSplit` method that returns approximate
splits of an RDD. I chose to implement a `randomSplitByKey` method in the
_PairRDDFunctions_. This uses a new sampling function in
_StratifiedSamplingUtils_ which allows sampling by filtering random keys into a
range and can provide a complement.
- The `randomSplitByKey` function uses the
[ScaSRS](http://jmlr.org/proceedings/papers/v28/meng13a.html) method of
computing the exact thresholds for each stratum k-1 times to provide exact
partitioning of each stratum. These exact thresholds are passed into the
sampling function which then samples the data into partitions. For simplicity,
this method returns each split AND each split's complement.
- The current proposed solution avoids changing any existing RDD level
functions or methods.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark Working_on_SPARK-8971
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/8112.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #8112
----
commit 24c0182a3a87d95042b0d6259b2200a0e6b89950
Author: sethah <[email protected]>
Date: 2015-08-08T00:04:13Z
Adding stratified sampling to cross validation and train validation split
in ml/tuning
commit cb232670a44b9dd9839aa6a829f97038cb15131b
Author: sethah <[email protected]>
Date: 2015-08-10T22:26:38Z
Adding tests for stratified splits
commit 16644120642ae8e828a6c169c8d92c3a96a1fdc7
Author: sethah <[email protected]>
Date: 2015-08-11T21:09:48Z
some scalastyle corrections
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]