[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

MarcKaminski Thu, 14 Sep 2017 05:43:19 -0700

Github user MarcKaminski commented on the issue:

    https://github.com/apache/spark/pull/17819
  
    Hello, 
    
    I found a bug that occurs when putting the new Bucketizer into a Pipeline 
and calling fit on it. 
    Calling fit on a Pipeline calls the corresponding transformSchema of each 
PipelineStage in it. 
    
    Therefore, the transformSchema [method of the 
Bucketizer](https://github.com/viirya/spark-1/blob/f8dedd1c92a8c48358743626b99c2f2192bc09b1/mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala#L146)
 is called, which checks for the params of the **conventional** Bucketizer 
(i.e. inputCol). 
    
    Steps to reproduce: 
    ```
    import org.apache.spark.ml._
    import org.apache.spark.ml.feature.Bucketizer
    
    case class data(f1: Double, f2: Double) 
    val datArr = Array(data(0.5, 0.3), data(0.5, -0.4))
    val df = spark.createDataFrame(datArr)
    
    val bucket = new Bucketizer()
      .setInputCols(Array("f1", "f2"))
      .setOutputCols(Array("f1_bu", "f2_bu"))
      .setSplitsArray(Array(Array(-0.5, 0.0, 0.5), Array(-0.5, 0.0, 0.5)))
    
    // Will work
    bucket.transform(df) show
    
    // Will fail catastrophically
    val pl = new Pipeline()
     .setStages(Array(bucket))
     .fit(df)
    ```
    
    Exception thrown by last line: 
    
    ```
    java.util.NoSuchElementException: Failed to find a default value for 
inputCol
      at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691)
      at 
org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:691)
      at scala.Option.getOrElse(Option.scala:121)
      at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:690)
      at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
      at org.apache.spark.ml.param.Params$class.$(params.scala:697)
      at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
      at 
org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:147)
      at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
      at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:184)
      at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
      at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
      at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
      at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:184)
      at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
      at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:136)
      ... 55 elided
    ```
    
    Since this has not yet been merged into Master, maybe you'd be still able 
to fix this and add a test for?
    Thanks!




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #17819: [SPARK-20542][ML][SQL] Add an API to Bucketizer that can...

Reply via email to