zhengruifeng commented on a change in pull request #28704:
URL: https://github.com/apache/spark/pull/28704#discussion_r437107792



##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): 
Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       I also perfer adding a numFolds checking here, but not strongly.
   Since ML imlps tends to transform input dataframe to `RDD[Vector]` and then 
cache it, compared with the training, this checking maybe cheap.
   

##########
File path: mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
##########
@@ -248,6 +248,19 @@ object MLUtils extends Logging {
     }.toArray
   }
 
+  /**
+   * Version of `kFold()` taking a fold column name.
+   */
+  @Since("3.1.0")
+  def kFold(df: DataFrame, numFolds: Int, foldColName: String): 
Array[(RDD[Row], RDD[Row])] = {
+    val foldCol = df.col(foldColName)
+    val dfWithMod = df.withColumn(foldColName, pmod(foldCol, lit(numFolds)))
+    (0 until numFolds).map { fold =>
+      (dfWithMod.filter(col(foldColName) =!= fold).drop(foldColName).rdd,
+        dfWithMod.filter(col(foldColName) === fold).drop(foldColName).rdd)

Review comment:
       Both, 1, checking `foldCol` valuse are in [0, numFolds); 2, for each 
fold, both the train rdd and the validation rdd are not empty.
   
   I am afraid if the `numFolds` is wrongly set, for example, numFolds=3, and 
the `foldCol` values are in {0, 1, 2, 3}, then there maybe task skewness;
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to