GitHub user rawkintrevo opened a pull request:
https://github.com/apache/flink/pull/1898
[FLINK-2259][ml] Add Train-Testing Splitters
This PR adds an object in ml/pipeline called splitter with the following
methods:
randomSplit: Splits a DataSet into two data sets using DataSet.sample
multiRandomSplit: Splits a DataSet into multiple datasets according to an
array of proportions
kFoldSplit: Splits DataSet into k TrainTest objects which have a testing
data set of size 1/k of the original dataset and the remainder of the dataset
in the training
trainTestSplit: A wrapper for randomSplit that return a TrainTest object
trainTestHoldoutSplit: A wrapper for multiRandomSplit that returns a
TrainTestHoldout object
the TrainTest and TrainTestHoldout objects are case classes. randomSplit
and multiRandomSplit return arrays of DataSets.
- [x] General
- [ ] Documentation
- Documentation is in code, will write markdown after
review/feedback/finalization
- [x] Tests & Build
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rawkintrevo/flink train-test-split
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/1898.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #1898
----
commit ec1e65a31d80b33589b73619f2a5dd0a8e09c568
Author: Trevor Grant <[email protected]>
Date: 2016-04-15T22:37:51Z
Add Splitter Pre-processing
commit 3ecdc3818dd11a847136510dabe96f444924d319
Author: Trevor Grant <[email protected]>
Date: 2016-04-15T22:40:38Z
Add Splitter Pre-processing
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---