GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/20594

    [SPARK-23377][ML] Fixes Bucketizer with multiple columns persistence bug

    ## What changes were proposed in this pull request?
    
    #### Problem:
    
    Since 2.3, `Bucketizer` supports multiple input/output columns. We will 
check if exclusive params are set during transformation. E.g., if `inputCols` 
and `outputCol` are both set, an error will be thrown.
    
    However, when we write `Bucketizer`, looks like the default params and 
user-supplied params are merged during writing. All saved params are loaded 
back and set to created model instance. So the default `outputCol` param in 
`HasOutputCol` trait will be set in `paramMap` and become an user-supplied 
param. That makes the check of exclusive params failed.
    
    #### Fix:
    
    This changes the saving logic of Bucketizer to handle this case. This is a 
quick fix to catch the time of 2.3. We should consider modify the persistence 
mechanism later.
    
    Please see the discussion in the JIRA.
    
    Note: The multi-column `QuantileDiscretizer` also has the same issue.
    
    ## How was this patch tested?
    
    Modified tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 SPARK-23377-2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20594.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20594
    
----
commit 9cd7c86fad04c814b2c8f5547583122ba12c359b
Author: Liang-Chi Hsieh <viirya@...>
Date:   2018-02-13T03:51:41Z

    Remove outputCol default value if inputCols is set.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to