Liang-Chi Hsieh commented on SPARK-23377:

I have no objection to [~josephkb]'s proposal (first 2nd and later 3rd).


The considering design is we should keep the default values of original Spark 
when saving the model, or use the default values of the Spark when loading the 
model. To keep the default values of original Spark, can make the behavior of 
the saved models reproducible. However, I have in mind that the behavior 
between loaded models and models created with current Spark can be different. 
E.g., The model "foo" from 2.1 with default value as "a" can reproducible 
behavior when loading back into 2.3. But it behaves differently with the same 
"foo" model created in 2.3 if the default value is changed to "b".


In other words, one is to keep the model behavior consistent before and after 
persistence even across Spark versions. Another one is to let the same kind of 
models has consistent behavior even they are coming from different Spark 


Current my patch follows the later one. I think the user should notice the 
change of default values in upgraded Spark, if they want to use old models. 
Btw, I also think of a rare but possible situation is, if we remove the default 
value from old version, the old models may not be easily loaded into new Spark.



> Bucketizer with multiple columns persistence bug
> ------------------------------------------------
>                 Key: SPARK-23377
>                 URL: https://issues.apache.org/jira/browse/SPARK-23377
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>       at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>       at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>       at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>       at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>       at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to