zhengruifeng created SPARK-22971: ------------------------------------ Summary: OneVsRestModel should use temporary RawPredictionCol Key: SPARK-22971 URL: https://issues.apache.org/jira/browse/SPARK-22971 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: zhengruifeng Priority: Minor
Issue occurs when I transform one dataframe with two different classification models, first by a {{RandomForestClassificationModel}}, then a {{OneVsRestModel}}. The first transform generate a new colum "rawPrediction", which will be internally used in {{OneVsRestModel#transform}} and cause failure. {code} scala> val df = spark.read.format("libsvm").load("/Users/zrf/Dev/OpenSource/spark/data/mllib/sample_multiclass_classification_data.txt") 18/01/05 17:08:18 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException df: org.apache.spark.sql.DataFrame = [label: double, features: vector] scala> val rf = new RandomForestClassifier() rf: org.apache.spark.ml.classification.RandomForestClassifier = rfc_c11b1e1e1f7f scala> val rfm = rf.fit(df) rfm: org.apache.spark.ml.classification.RandomForestClassificationModel = RandomForestClassificationModel (uid=rfc_c11b1e1e1f7f) with 20 trees scala> val lr = new LogisticRegression().setMaxIter(1) lr: org.apache.spark.ml.classification.LogisticRegression = logreg_f5a5285eba06 scala> val ovr = new OneVsRest().setClassifier(lr) ovr: org.apache.spark.ml.classification.OneVsRest = oneVsRest_8f5584190634 scala> val ovrModel = ovr.fit(df) ovrModel: org.apache.spark.ml.classification.OneVsRestModel = oneVsRest_8f5584190634 scala> val df2 = rfm.setPredictionCol("rfPred").transform(df) df2: org.apache.spark.sql.DataFrame = [label: double, features: vector ... 3 more fields] scala> val df3 = ovrModel.setPredictionCol("ovrPred").transform(df2) java.lang.IllegalArgumentException: requirement failed: Column rawPrediction already exists. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:101) at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:91) at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:43) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:77) at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37) at org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:904) at org.apache.spark.ml.classification.LogisticRegressionParams$class.validateAndTransformSchema(LogisticRegression.scala:265) at org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:904) at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:104) at org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:184) at org.apache.spark.ml.classification.OneVsRestModel$$anonfun$7.apply(OneVsRest.scala:173) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:173) ... 50 elided {code} {{OneVsRestModel#transform}} only generates a new prediction column, and should not fail by other columns. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org