Yanbo Liang created SPARK-16750:
-----------------------------------
Summary: ML GaussianMixture training failed due to feature column
type mistake
Key: SPARK-16750
URL: https://issues.apache.org/jira/browse/SPARK-16750
Project: Spark
Issue Type: Bug
Components: ML
Reporter: Yanbo Liang
Assignee: Yanbo Liang
ML GaussianMixture training failed due to feature column type mistake. The
feature column type should be {{ml.linalg.VectorUDT}} but got
{{mllib.linalg.VectorUDT}} by mistake.
This bug is easy to reproduce by the following code:
{code}
val df = spark.createDataFrame(
Seq(
(1, Vectors.dense(0.0, 1.0, 4.0)),
(2, Vectors.dense(1.0, 0.0, 4.0)),
(3, Vectors.dense(1.0, 0.0, 5.0)),
(4, Vectors.dense(0.0, 0.0, 5.0)))
).toDF("id", "features")
val scaler = new MinMaxScaler()
.setInputCol("features")
.setOutputCol("features_scaled")
.setMin(0.0)
.setMax(5.0)
val gmm = new GaussianMixture()
.setFeaturesCol("features_scaled")
.setK(2)
val pipeline = new Pipeline().setStages(Array(scaler, gmm))
pipeline.fit(df)
requirement failed: Column features_scaled must be of type
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
java.lang.IllegalArgumentException: requirement failed: Column features_scaled
must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was
actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
at scala.Predef$.require(Predef.scala:224)
at
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at
org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
at
org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
at
org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
{code}
The reason for not this bug was not found during unit tests is that some
estimators/transformers missed firstly calling
{{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added
them for all estimators/transformers who missed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]