[ 
https://issues.apache.org/jira/browse/FLINK-2259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15243774#comment-15243774
 ] 

ASF GitHub Bot commented on FLINK-2259:
---------------------------------------

GitHub user rawkintrevo opened a pull request:

    https://github.com/apache/flink/pull/1898

    [FLINK-2259][ml] Add Train-Testing Splitters

    This PR adds an object in ml/pipeline called splitter with the following 
methods:
    
    randomSplit: Splits a DataSet into two data sets using DataSet.sample
    multiRandomSplit: Splits a DataSet into multiple datasets according to an 
array of proportions
    kFoldSplit: Splits DataSet into k TrainTest objects which have a testing 
data set of size 1/k of the original dataset and the remainder of the dataset 
in the training
    trainTestSplit: A wrapper for randomSplit that return a TrainTest object
    trainTestHoldoutSplit: A wrapper for multiRandomSplit that returns a 
TrainTestHoldout object
    
    the TrainTest and TrainTestHoldout objects are case classes.  randomSplit 
and multiRandomSplit return arrays of DataSets.
    
    - [x] General
      
    - [ ] Documentation
      - Documentation is in code, will write markdown after 
review/feedback/finalization
    
    - [x] Tests & Build


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rawkintrevo/flink train-test-split

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1898.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1898
    
----
commit ec1e65a31d80b33589b73619f2a5dd0a8e09c568
Author: Trevor Grant <[email protected]>
Date:   2016-04-15T22:37:51Z

    Add Splitter Pre-processing

commit 3ecdc3818dd11a847136510dabe96f444924d319
Author: Trevor Grant <[email protected]>
Date:   2016-04-15T22:40:38Z

    Add Splitter Pre-processing

----


> Support training Estimators using a (train, validation, test) split of the 
> available data
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-2259
>                 URL: https://issues.apache.org/jira/browse/FLINK-2259
>             Project: Flink
>          Issue Type: New Feature
>          Components: Machine Learning Library
>            Reporter: Theodore Vasiloudis
>            Assignee: Trevor Grant
>            Priority: Minor
>              Labels: ML
>
> When there is an abundance of data available, a good way to train models is 
> to split the available data into 3 parts: Train, Validation and Test.
> We use the Train data to train the model, the Validation part is used to 
> estimate the test error and select hyperparameters, and the Test is used to 
> evaluate the performance of the model, and assess its generalization [1]
> This is a common approach when training Artificial Neural Networks, and a 
> good strategy to choose in data-rich environments. Therefore we should have 
> some support of this data-analysis process in our Estimators.
> [1] Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of 
> statistical learning. Vol. 1. Springer, Berlin: Springer series in 
> statistics, 2001.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to