[
https://issues.apache.org/jira/browse/FLINK-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14633165#comment-14633165
]
ASF GitHub Bot commented on FLINK-1723:
---------------------------------------
Github user sachingoel0101 commented on a diff in the pull request:
https://github.com/apache/flink/pull/891#discussion_r34975724
--- Diff:
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/evaluation/CrossValidation.scala
---
@@ -0,0 +1,97 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.evaluation
+
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.RichDataSet
+import java.util.Random
+
+import org.apache.flink.ml.pipeline.{EvaluateDataSetOperation,
FitOperation, Predictor}
+
+object CrossValidation {
+ def crossValScore[P <: Predictor[P], T](
+ predictor: P,
+ data: DataSet[T],
+ scorerOption: Option[Scorer] = None,
+ cv: FoldGenerator = KFold(),
+ seed: Long = new Random().nextLong())(implicit fitOperation:
FitOperation[P, T],
+ evaluateDataSetOperation: EvaluateDataSetOperation[P, T, Double]):
Array[DataSet[Double]] = {
+ val folds = cv.folds(data, 1)
+
+ val scores = folds.map {
+ case (training: DataSet[T], testing: DataSet[T]) =>
+ predictor.fit(training)
+ if (scorerOption.isEmpty) {
+ predictor.score(testing)
+ } else {
+ val s = scorerOption.get
+ s.evaluate(testing, predictor)
+ }
+ }
+ // TODO: Undecided on the return type: Array[DS[Double]] or DS[Double]
i.e. reduce->union?
+ // Or: Return mean and std?
+ scores//.reduce((right: DataSet[Double], left: DataSet[Double]) =>
left.union(right)).mean()
+ }
+}
+
+abstract class FoldGenerator {
+
+ /** Takes a DataSet as input and creates splits (folds) of the data into
+ * (training, testing) pairs.
+ *
+ * @param input The DataSet that will be split into folds
+ * @param seed Seed for replicable splitting of the data
+ * @tparam T The type of the DataSet
+ * @return An Array containing K (training, testing) tuples, where
training and testing are
+ * DataSets
+ */
+ def folds[T](
+ input: DataSet[T],
+ seed: Long = new Random().nextLong()): Array[(DataSet[T],
DataSet[T])]
+}
+
+class KFold(numFolds: Int) extends FoldGenerator{
+
+ /** Takes a DataSet as input and creates K splits (folds) of the data
into non-overlapping
+ * (training, testing) pairs.
+ *
+ * Code based on Apache Spark implementation
+ * @param input The DataSet that will be split into folds
+ * @param seed Seed for replicable splitting of the data
+ * @tparam T The type of the DataSet
+ * @return An Array containing K (training, testing) tuples, where
training and testing are
+ * DataSets
+ */
+ override def folds[T](
+ input: DataSet[T],
+ seed: Long = new Random().nextLong()): Array[(DataSet[T],
DataSet[T])] = {
+ val numFoldsF = numFolds.toFloat
+ (1 to numFolds).map { fold =>
+ val lb = (fold - 1) / numFoldsF
+ val ub = fold / numFoldsF
+ val validation = input.sampleBounded(lb, ub, complement = false,
seed = seed)
+ val training = input.sampleBounded(lb, ub, complement = true, seed =
seed)
+ (training, validation)
--- End diff --
However, in case the parallelism of data is more than one, this can lead to
problem. The random number sequence generated on every node would be the same,
wouldn't it?
I printed all the random numbers generated and it looks like this:
https://gist.github.com/sachingoel0101/ecde269af996fba7a39a
Further, for a parallelism of 2, the test itself fails.
> Add cross validation for model evaluation
> -----------------------------------------
>
> Key: FLINK-1723
> URL: https://issues.apache.org/jira/browse/FLINK-1723
> Project: Flink
> Issue Type: New Feature
> Components: Machine Learning Library
> Reporter: Till Rohrmann
> Assignee: Theodore Vasiloudis
> Labels: ML
>
> Cross validation [1] is a standard tool to estimate the test error for a
> model. As such it is a crucial tool for every machine learning library.
> The cross validation should work with arbitrary Estimators and error metrics.
> A first cross validation strategy it should support is the k-fold cross
> validation.
> Resources:
> [1] [http://en.wikipedia.org/wiki/Cross-validation]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)