firestarman commented on a change in pull request #25983: 
[SPARK-29327][MLLIB]Support specifying features via multiple columns
URL: https://github.com/apache/spark/pull/25983#discussion_r332817984
 
 

 ##########
 File path: mllib/src/test/scala/org/apache/spark/ml/PredictorSuite.scala
 ##########
 @@ -55,14 +55,50 @@ class PredictorSuite extends SparkFunSuite with 
MLlibTestSparkContext {
       predictor.fit(df.select(col("label"), col("weight").cast(StringType), 
col("features")))
     }
   }
+
+  test("multiple columns for features should work well without side effect") {
+    // Should fail due to not supporting multiple columns
+    intercept[IllegalArgumentException] {
+      new MockPredictor(false).setFeaturesCol(Array("feature1", "feature2", 
"feature3"))
+    }
+
+    // Only use multiple columns for features
+    val df = spark.createDataFrame(Seq(
+      (0, 1, 0, 2, 3),
+      (1, 2, 0, 3, 9),
+      (0, 3, 0, 2, 6)
+    )).toDF("label", "weight", "feature1", "feature2", "feature3")
+
+    val predictor = new MockPredictor().setWeightCol("weight")
+      .setFeaturesCol(Array("feature1", "feature2", "feature3"))
+    predictor.fit(df)
+
+    // Should fail due to wrong type for column "feature1" in schema
+    intercept[IllegalArgumentException] {
+      predictor.fit(df.select(col("label"), col("weight"),
+        col("feature1").cast(StringType), col("feature2"), col("feature3")))
+    }
+
+    val df2 = df.toDF("label", "weight", "features", "feature2", "feature3")
+    // Should fail due to missing "feature1" in schema
+    intercept[IllegalArgumentException] {
+      predictor.setFeaturesCol(Array("feature1", "feature2", 
"feature3")).fit(df2)
+    }
+
+    // Should fail due to wrong type in schema for single column of features
 
 Review comment:
   Thanks for review.  Updated the comments
   Actually that's expected. I mean only the names ("feature2", "feature3") 
passed into `setFeaturesCol(Array)` are wanted to use as multiple columns. But 
"features" is provided in "df2" schema, equal to the default value of the 
single column name (just like calling `setFeaturesCol("features")`). Then my 
current design supposes users are trying to use both single column and multiple 
columns, and does type check for both of them. As said above ,"features" now is 
used as single column, and should be "Vector" but actually "Int", so the test 
fails.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to