Github user MarcKaminski commented on the issue:
https://github.com/apache/spark/pull/17819
Hello,
I found a bug that occurs when putting the new Bucketizer into a Pipeline
and calling fit on it.
Calling fit on a Pipeline calls the corresponding transformSchema of each
PipelineStage in it.
Therefore, the transformSchema [method of the
Bucketizer](https://github.com/viirya/spark-1/blob/f8dedd1c92a8c48358743626b99c2f2192bc09b1/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L146)
is called, which checks for the params of the **conventional** Bucketizer
(i.e. inputCol).
Steps to reproduce:
```
import org.apache.spark.ml._
import org.apache.spark.ml.feature.Bucketizer
case class data(f1: Double, f2: Double)
val datArr = Array(data(0.5, 0.3), data(0.5, -0.4))
val df = spark.createDataFrame(datArr)
val bucket = new Bucketizer()
.setInputCols(Array("f1", "f2"))
.setOutputCols(Array("f1_bu", "f2_bu"))
.setSplitsArray(Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5)))
// Will work
bucket.transform(df) show
// Will fail catastrophically
val pl = new Pipeline()
.setStages(Array(bucket))
.fit(df)
```
Exception thrown by last line:
```
java.util.NoSuchElementException: Failed to find a default value for
inputCol
at
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691)
at
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:690)
at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
at org.apache.spark.ml.param.Params$class.$(params.scala:697)
at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
at
org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:147)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
at
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
at
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
... 55 elided
```
Since this has not yet been merged into Master, maybe you'd be still able
to fix this and add a test for?
Thanks!
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]