GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/11965
[SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormula
## What changes were proposed in this pull request?
StringIndexerModel.transform sets the output column metadata to use name
inputCol. It should not. Fixing this causes a problem with the metadata
produced by RFormula.
Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite,
and I modified VectorAttributeRewriter to find and replace all "prefixes" since
attributes collect multiple prefixes from StringIndexer + Interaction.
Note that "prefixes" is no longer accurate since internal strings may be
replaced.
## How was this patch tested?
Unit test which failed before this fix.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark StringIndexer-fix
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/11965.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #11965
----
commit 5d0a0b3670bab564ff2991d24ebba96c3cb0bfbd
Author: Joseph K. Bradley <[email protected]>
Date: 2016-03-25T17:09:25Z
Fixed bug in StringIndexer, and fixed problem caused by that fix in RFormula
commit 36bf7307338c5d0eba953f5ee53ee9e2d889db3a
Author: Joseph K. Bradley <[email protected]>
Date: 2016-03-25T20:42:58Z
Added StringIndexer unit test which failed before fix
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]