[ https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-16750: ------------------------------------ Assignee: Apache Spark (was: Yanbo Liang) > ML GaussianMixture training failed due to feature column type mistake > --------------------------------------------------------------------- > > Key: SPARK-16750 > URL: https://issues.apache.org/jira/browse/SPARK-16750 > Project: Spark > Issue Type: Bug > Components: ML > Reporter: Yanbo Liang > Assignee: Apache Spark > > ML GaussianMixture training failed due to feature column type mistake. The > feature column type should be {{ml.linalg.VectorUDT}} but got > {{mllib.linalg.VectorUDT}} by mistake. > This bug is easy to reproduce by the following code: > {code} > val df = spark.createDataFrame( > Seq( > (1, Vectors.dense(0.0, 1.0, 4.0)), > (2, Vectors.dense(1.0, 0.0, 4.0)), > (3, Vectors.dense(1.0, 0.0, 5.0)), > (4, Vectors.dense(0.0, 0.0, 5.0))) > ).toDF("id", "features") > val scaler = new MinMaxScaler() > .setInputCol("features") > .setOutputCol("features_scaled") > .setMin(0.0) > .setMax(5.0) > val gmm = new GaussianMixture() > .setFeaturesCol("features_scaled") > .setK(2) > val pipeline = new Pipeline().setStages(Array(scaler, gmm)) > pipeline.fit(df) > requirement failed: Column features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > java.lang.IllegalArgumentException: requirement failed: Column > features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) > at > org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) > at > org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) > {code} > Why the unit tests did not complain this errors? Because some > estimators/transformers missed calling {{transformSchema(dataset.schema)}} > firstly during {{fit}} or {{transform}}. I will also add this function to all > estimators/transformers who missed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org