GitHub user jkbradley opened a pull request:
https://github.com/apache/spark/pull/20132
[SPARK-13030][ML] Follow-up cleanups for OneHotEncoderEstimator
## What changes were proposed in this pull request?
Follow-up cleanups for the OneHotEncoderEstimator PR. See some discussion
in the original PR: https://github.com/apache/spark/pull/19527 or read below
for what this PR includes:
* configedCategorySize: I reverted this to return an Array. I realized the
original setup (which I had recommended in the original PR) caused the whole
model to be serialized in the UDF.
* encoder: I reorganized the logic to show what I meant in the comment in
the previous PR. I think it's simpler but am open to suggestions.
I also made some small style cleanups based on IntelliJ warnings.
## How was this patch tested?
Existing unit tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jkbradley/spark viirya-SPARK-13030
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/20132.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #20132
----
commit 9bf045da1adeaa08deeb96eaa0289d8d4cb74bc1
Author: Joseph K. Bradley <joseph@...>
Date: 2017-12-31T23:25:45Z
updates for final PR
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]