[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

Joseph K. Bradley (JIRA) Mon, 12 Feb 2018 09:49:02 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361154#comment-16361154
 ]


Joseph K. Bradley commented on SPARK-23377:
-------------------------------------------

[~viirya]'s patch currently changes DefaultParamsWriter to save only the 
explicitly set Param values.  This means that loading a model into a new 
version of Spark could use different Param values if the default values have 
changed.

In the original design of persistence (see [SPARK-6725]), the goal was to make 
behavior exactly reproducible.  This means that default Param values do need to 
be saved.  I recommend that we maintain this guarantee.

I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this 
edge case (by removing the value for inputCol when inputCols is set).  This may 
be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case.  This 
will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and 
default values.  Then update Bucketizer's loading logic to make use of this 
distinction.  I'm not a fan of this approach since it would involve shoving a 
huge change into branch-2.3 during late QA.

I'd vote strongly for the 2nd option.  Opinions?

> Bucketizer with multiple columns persistence bug
> ------------------------------------------------
>
>                 Key: SPARK-23377
>                 URL: https://issues.apache.org/jira/browse/SPARK-23377
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>       at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>       at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>       at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>       at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>       at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-23377) Bucketizer with multiple columns persistence bug

Reply via email to