GitHub user VinceShieh opened a pull request:
https://github.com/apache/spark/pull/14640
[SPARK-17055] add labelKFold to CrossValidator
## What changes were proposed in this pull request?
This patch improves the CrossValidator by adding a new training/validation
split method -labelKFold, which splits data based on data labels and makes sure
that the same label is not in both testing and training sets.
This is necessary, for example when data is gathered from different
subjects by testing and training on different subjects, i.e., learning cat
specific features, and it can avoid over-fitting.
## How was this patch tested?
Unit test was added to MLUtilsSuite.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/VinceShieh/spark labelKFold2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14640.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14640
----
commit cbb78bce4022bfc46f570264de4087a01a84b281
Author: Vincent Xie <[email protected]>
Date: 2016-08-08T13:28:08Z
Add labelKFold to cross validation
Currently, only KFold is supported in cross validation. But in cases
when data is gathered from different subjects and we want to avoid
over-fitting. labelKFold is a variation of k-fold which ensures that
the same label is not in both testing and training sets.
Unit test -'test labelKFold', is also added in MLUtilsSuite
Signed-off-by: Vincent Xie <[email protected]>
Signed-off-by: VinceShieh <[email protected]>
commit 461d696aa6aa41818be31dc1628e3282e560854a
Author: VinceShieh <[email protected]>
Date: 2016-08-15T01:53:51Z
Merge remote-tracking branch 'origin/master' into labelKFold2
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]