[
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361154#comment-16361154
]
Joseph K. Bradley commented on SPARK-23377:
-------------------------------------------
[~viirya]'s patch currently changes DefaultParamsWriter to save only the
explicitly set Param values. This means that loading a model into a new
version of Spark could use different Param values if the default values have
changed.
In the original design of persistence (see [SPARK-6725]), the goal was to make
behavior exactly reproducible. This means that default Param values do need to
be saved. I recommend that we maintain this guarantee.
I can see a couple of possibilities:
1. Simplest: Change the loading logic of Bucketizer so that it handles this
edge case (by removing the value for inputCol when inputCols is set). This may
be best for Spark 2.3 since it's the fastest fix.
2. Reasonable: Change the saving logic of Bucketizer to handle this case. This
will be best in terms of fixing the edge case and being pretty quick to do.
3. Largest: Change DefaultParamsWriter to separate explicitly set values and
default values. Then update Bucketizer's loading logic to make use of this
distinction. I'm not a fan of this approach since it would involve shoving a
huge change into branch-2.3 during late QA.
I'd vote strongly for the 2nd option. Opinions?
> Bucketizer with multiple columns persistence bug
> ------------------------------------------------
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 2.3.0
> Reporter: Bago Amirbekian
> Priority: Critical
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the
> default value on write -> read which causes it to throw an error on
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100,
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
> .setSplitsArray(Array(splits, splits))
> .setInputCols(Array("foo1", "foo2"))
> .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has
> the inputCols Param set for multi-column transform. The following Params are
> not applicable and should not be set: outputCol.
> at
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
> at
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
> at
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
> at
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
> at
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-6079631:17)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]