[
https://issues.apache.org/jira/browse/FLINK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14618799#comment-14618799
]
ASF GitHub Bot commented on FLINK-1723:
---------------------------------------
Github user thvasilo commented on a diff in the pull request:
https://github.com/apache/flink/pull/891#discussion_r34164157
--- Diff:
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/evaluation/CrossValidation.scala
---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.evaluation
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.RichDataSet
+import java.util.Random
+
+import org.apache.flink.ml.pipeline.{EvaluateDataSetOperation,
FitOperation, Predictor}
+
+object CrossValidation {
+ def crossValScore[P <: Predictor[P], T](
+ predictor: P,
+ data: DataSet[T],
+ scorerOption: Option[Scorer] = None,
+ cv: FoldGenerator = KFold(),
+ seed: Long = new Random().nextLong())(implicit fitOperation:
FitOperation[P, T],
+ evaluateDataSetOperation: EvaluateDataSetOperation[P, T, Double]):
Array[DataSet[Double]] = {
+ val folds = cv.folds(data, 1)
+
+ val scores = folds.map {
+ case (training: DataSet[T], testing: DataSet[T]) =>
+ predictor.fit(training)
+ if (scorerOption.isEmpty) {
+ predictor.score(testing)
+ } else {
+ val s = scorerOption.get
+ s.evaluate(testing, predictor)
+ }
+ }
+ // TODO: Undecided on the return type: Array[DS[Double]] or DS[Double]
i.e. reduce->union?
+ // Or: Return mean and std?
+ scores//.reduce((right: DataSet[Double], left: DataSet[Double]) =>
left.union(right)).mean()
--- End diff --
sklearn defines the `score` function separately for the `Classifier` and
`Regressor` mixins.
We can also try defining it in a trait `WithScore` but I'm not sure how
that would work for chaining.
> Add cross validation for model evaluation
> -----------------------------------------
>
> Key: FLINK-1723
> URL: https://issues.apache.org/jira/browse/FLINK-1723
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Theodore Vasiloudis
> Labels: ML
>
> Cross validation [1] is a standard tool to estimate the test error for a
> model. As such it is a crucial tool for every machine learning library.
> The cross validation should work with arbitrary Estimators and error metrics.
> A first cross validation strategy it should support is the k-fold cross
> validation.
> Resources:
> [1] [http://en.wikipedia.org/wiki/Cross-validation]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)