[
https://issues.apache.org/jira/browse/FLINK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14618274#comment-14618274
]
ASF GitHub Bot commented on FLINK-1723:
---------------------------------------
GitHub user thvasilo opened a pull request:
https://github.com/apache/flink/pull/891
[FLINK-1723] [ml] [WIP] Add cross validation for model evaluation
Cross validation (CV) [1] is a standard tool to estimate the test error for
a model. As such it is a crucial tool for every machine learning library.
This builds upon the ongoing work on the evaluation framework for FlinkML.
As such, the current version supports calculating the score of Predictors
only, however the end goal is to be able to have CV for Estimators as well to
cover the unsupervised learning case.
We are using some code from the Apache Spark project, mostly simple
routines for probabilistic sampling of datasets and generation of KFold CV data.
More and better tests need to be added to the implementation, and the
current sampling approaches probably will not work if used within an iteration.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/thvasilo/flink cross-validation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/891.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #891
----
commit 305b43a451af3d8bc859671476c215308fbfc7fc
Author: mikiobraun <[email protected]>
Date: 2015-06-22T15:04:42Z
Adding some first loss functions for the evaluation framework
commit bdb1a6912d2bcec29446ca4a9fbc550f2ecb8f4a
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-06-23T14:07:48Z
Scorer for evaluation
commit 4a7593ade68f43d444a6b289191f053a4ea8b031
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-06-25T09:41:10Z
Adds accuracy score and R^2 score. Also trying out Scores as classes
instead of functions.
Not too happy with the extra biolerplate of Score as classes will probably
revert,
and have objects like RegressionsScores, ClassificationScores that contain
the definitions
of the relevant scores.
commit 5c89c478bd00f168bfe48954d06367b28f948571
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-06-26T11:30:56Z
Adds a evaluate operation for LabeledVector input
commit e7bb4b42424641d640df370cd6ace71f7f42ee8d
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-06-26T11:32:13Z
Adds Regressor interface, and a score function for regression algorithms.
commit 3d8a6928b02b30c732f282df61613561dbf8d4fc
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-06-30T14:04:58Z
Added Classifier intermediate class, and default score function for
classifiers.
commit e1a26ed30bb784633685703892f67d51136f6060
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-01T08:20:41Z
Going back to having scores defined in objects instead of their own classes.
commit 0dd251a5a59cd610c4df3e9a1ea3921b1a9cc2e0
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-01T13:00:37Z
Removed ParameterMap from predict function of PredictOperation
commit 492e9a383af6285f0fdca5031d2bd7bdfe3cd511
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-02T10:21:28Z
Reworked score functionality allow chained Predictors.
All predictors must now implement a calculateScore function.
We are for now assuming that predictors are supervised learning algorithms,
once unsupervised learning algorithms are added this will need to be
reworked.
Also added an evaluate dataset operation to ALS, to allow for scoring of the
algorithm. Default performance measure for ALS is RMSE.
commit d9715ed3a6faba78e0b34368425768e826d5a736
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-06T08:50:59Z
Made calculateScore only take DataSet[(Double, Double)]
commit 4983c47917c2776a856271dd5ae62b2b3735c466
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-07T08:15:58Z
Added test for DataSet.mean()
commit 250a754797869772041e8cb65e3a9498ae9244d0
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-07T09:18:40Z
Added simple sampling algorithms, using filter()
commit 2a3de8866d3beefbb4f188494024aba96d219f97
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-07T10:10:33Z
Added KFold splitting
commit 1febc843b38cc1b727a45c35da2eb8f1684592e6
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-07T10:39:34Z
Made KFold into a class, added folds class parameter
commit 85f8ed0dde61cace3cbe3757e6645a999b56ebc5
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-07T12:11:45Z
Switched from cross to mapWithBcVariable
commit 44d9251ecc965bf7d2bb40ffdf2653c99750af12
Author: Theodore Vasiloudis <[email protected]>
Date: 2015-07-08T09:11:22Z
Added crossValScore function to compute the cross-validated score for a
predictor.
----
> Add cross validation for model evaluation
> -----------------------------------------
>
> Key: FLINK-1723
> URL: https://issues.apache.org/jira/browse/FLINK-1723
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Theodore Vasiloudis
> Labels: ML
>
> Cross validation [1] is a standard tool to estimate the test error for a
> model. As such it is a crucial tool for every machine learning library.
> The cross validation should work with arbitrary Estimators and error metrics.
> A first cross validation strategy it should support is the k-fold cross
> validation.
> Resources:
> [1] [http://en.wikipedia.org/wiki/Cross-validation]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)