GitHub user jkbradley opened a pull request:

    https://github.com/apache/spark/pull/11965

    [SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormula

    ## What changes were proposed in this pull request?
    
    StringIndexerModel.transform sets the output column metadata to use name 
inputCol.  It should not.  Fixing this causes a problem with the metadata 
produced by RFormula.
    
    Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, 
and I modified VectorAttributeRewriter to find and replace all "prefixes" since 
attributes collect multiple prefixes from StringIndexer + Interaction.
    
    Note that "prefixes" is no longer accurate since internal strings may be 
replaced.
    
    ## How was this patch tested?
    
    Unit test which failed before this fix.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jkbradley/spark StringIndexer-fix

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/11965.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #11965
    
----
commit 5d0a0b3670bab564ff2991d24ebba96c3cb0bfbd
Author: Joseph K. Bradley <[email protected]>
Date:   2016-03-25T17:09:25Z

    Fixed bug in StringIndexer, and fixed problem caused by that fix in RFormula

commit 36bf7307338c5d0eba953f5ee53ee9e2d889db3a
Author: Joseph K. Bradley <[email protected]>
Date:   2016-03-25T20:42:58Z

    Added StringIndexer unit test which failed before fix

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to