Bago Amirbekian created SPARK-23377:
---------------------------------------

             Summary: Bucketizer with multiple columns persistence bug
                 Key: SPARK-23377
                 URL: https://issues.apache.org/jira/browse/SPARK-23377
             Project: Spark
          Issue Type: Bug
          Components: ML
    Affects Versions: 2.3.0
            Reporter: Bago Amirbekian


A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to