GitHub user thvasilo opened a pull request:

    https://github.com/apache/flink/pull/891

    [FLINK-1723] [ml] [WIP] Add cross validation for model evaluation

    Cross validation (CV) [1] is a standard tool to estimate the test error for 
a model. As such it is a crucial tool for every machine learning library.
    
    This builds upon the ongoing work on the evaluation framework for FlinkML.
    As such, the current version supports calculating the score of Predictors 
only, however the end goal is to be able to have CV for Estimators as well to 
cover the unsupervised learning case.
    
    We are using some code from the Apache Spark project, mostly simple 
routines for probabilistic sampling of datasets and generation of KFold CV data.
    
    More and better tests need to be added to the implementation, and the 
current sampling approaches probably will not work if used within an iteration.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/thvasilo/flink cross-validation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/891.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #891
    
----
commit 305b43a451af3d8bc859671476c215308fbfc7fc
Author: mikiobraun <[email protected]>
Date:   2015-06-22T15:04:42Z

    Adding some first loss functions for the evaluation framework

commit bdb1a6912d2bcec29446ca4a9fbc550f2ecb8f4a
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-06-23T14:07:48Z

    Scorer for evaluation

commit 4a7593ade68f43d444a6b289191f053a4ea8b031
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-06-25T09:41:10Z

    Adds accuracy score and R^2 score. Also trying out Scores as classes 
instead of functions.
    
    Not too happy with the extra biolerplate of Score as classes will probably 
revert,
    and have objects like RegressionsScores, ClassificationScores that contain 
the definitions
    of the relevant scores.

commit 5c89c478bd00f168bfe48954d06367b28f948571
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-06-26T11:30:56Z

    Adds a evaluate operation for LabeledVector input

commit e7bb4b42424641d640df370cd6ace71f7f42ee8d
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-06-26T11:32:13Z

    Adds Regressor interface, and a score function for regression algorithms.

commit 3d8a6928b02b30c732f282df61613561dbf8d4fc
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-06-30T14:04:58Z

    Added Classifier intermediate class, and default score function for 
classifiers.

commit e1a26ed30bb784633685703892f67d51136f6060
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-01T08:20:41Z

    Going back to having scores defined in objects instead of their own classes.

commit 0dd251a5a59cd610c4df3e9a1ea3921b1a9cc2e0
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-01T13:00:37Z

    Removed ParameterMap from predict function of PredictOperation

commit 492e9a383af6285f0fdca5031d2bd7bdfe3cd511
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-02T10:21:28Z

    Reworked score functionality allow chained Predictors.
    
    All predictors must now implement a calculateScore function.
    We are for now assuming that predictors are supervised learning algorithms,
    once unsupervised learning algorithms are added this will need to be 
reworked.
    
    Also added an evaluate dataset operation to ALS, to allow for scoring of the
    algorithm. Default performance measure for ALS is RMSE.

commit d9715ed3a6faba78e0b34368425768e826d5a736
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-06T08:50:59Z

    Made calculateScore only take DataSet[(Double, Double)]

commit 4983c47917c2776a856271dd5ae62b2b3735c466
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-07T08:15:58Z

    Added test for DataSet.mean()

commit 250a754797869772041e8cb65e3a9498ae9244d0
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-07T09:18:40Z

    Added simple sampling algorithms, using filter()

commit 2a3de8866d3beefbb4f188494024aba96d219f97
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-07T10:10:33Z

    Added KFold splitting

commit 1febc843b38cc1b727a45c35da2eb8f1684592e6
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-07T10:39:34Z

    Made KFold into a class, added folds class parameter

commit 85f8ed0dde61cace3cbe3757e6645a999b56ebc5
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-07T12:11:45Z

    Switched from cross to mapWithBcVariable

commit 44d9251ecc965bf7d2bb40ffdf2653c99750af12
Author: Theodore Vasiloudis <[email protected]>
Date:   2015-07-08T09:11:22Z

    Added crossValScore function to compute the cross-validated score for a 
predictor.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to