GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/20594
[SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug
## What changes were proposed in this pull request?
#### Problem:
Since 2.3, `Bucketizer` supports multiple input/output columns. We will
check if exclusive params are set during transformation. E.g., if `inputCols`
and `outputCol` are both set, an error will be thrown.
However, when we write `Bucketizer`, looks like the default params and
user-supplied params are merged during writing. All saved params are loaded
back and set to created model instance. So the default `outputCol` param in
`HasOutputCol` trait will be set in `paramMap` and become an user-supplied
param. That makes the check of exclusive params failed.
#### Fix:
This changes the saving logic of Bucketizer to handle this case. This is a
quick fix to catch the time of 2.3. We should consider modify the persistence
mechanism later.
Please see the discussion in the JIRA.
Note: The multi-column `QuantileDiscretizer` also has the same issue.
## How was this patch tested?
Modified tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-23377-2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20594.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20594
----
commit 9cd7c86fad04c814b2c8f5547583122ba12c359b
Author: Liang-Chi Hsieh <viirya@...>
Date: 2018-02-13T03:51:41Z
Remove outputCol default value if inputCols is set.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]