[ 
https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16750:
------------------------------------

    Assignee: Apache Spark  (was: Yanbo Liang)

> ML GaussianMixture training failed due to feature column type mistake
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16750
>                 URL: https://issues.apache.org/jira/browse/SPARK-16750
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Yanbo Liang
>            Assignee: Apache Spark
>
> ML GaussianMixture training failed due to feature column type mistake. The 
> feature column type should be {{ml.linalg.VectorUDT}} but got 
> {{mllib.linalg.VectorUDT}} by mistake.
> This bug is easy to reproduce by the following code:
> {code}
> val df = spark.createDataFrame(
>   Seq(
>     (1, Vectors.dense(0.0, 1.0, 4.0)),
>     (2, Vectors.dense(1.0, 0.0, 4.0)),
>     (3, Vectors.dense(1.0, 0.0, 5.0)),
>     (4, Vectors.dense(0.0, 0.0, 5.0)))
> ).toDF("id", "features")
> val scaler = new MinMaxScaler()
>   .setInputCol("features")
>   .setOutputCol("features_scaled")
>   .setMin(0.0)
>   .setMax(5.0)
> val gmm = new GaussianMixture()
>   .setFeaturesCol("features_scaled")
>   .setK(2)
> val pipeline = new Pipeline().setStages(Array(scaler, gmm))
> pipeline.fit(df)
> requirement failed: Column features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> java.lang.IllegalArgumentException: requirement failed: Column 
> features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
>       at scala.Predef$.require(Predef.scala:224)
>       at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>       at 
> org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
>       at 
> org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
>       at 
> org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
>       at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>       at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>       at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>       at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>       at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>       at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
>       at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>       at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
> {code}
> Why the unit tests did not complain this errors? Because some 
> estimators/transformers missed calling {{transformSchema(dataset.schema)}} 
> firstly during {{fit}} or {{transform}}. I will also add this function to all 
> estimators/transformers who missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to