GitHub user viirya opened a pull request:
https://github.com/apache/spark/pull/19527
[SPARK-13030][ML] Create OneHotEncoderEstimator for OneHotEncoder as
Estimator
## What changes were proposed in this pull request?
This patch adds a new class `OneHotEncoderEstimator` which extends
`Estimator`. The `fit` method returns `OneHotEncoderModel`.
Common methods between existing `OneHotEncoder` and new
`OneHotEncoderEstimator`, such as transforming schema, are extracted and put
into `OneHotEncoderCommon`.
### Multi-column support
`OneHotEncoderEstimator` adds simpler multi-column support because it is
new API and can be free from backward compatibility.
### handleInvalid Param support
`OneHotEncoderEstimator` supports `handleInvalid` Param. It supports
`error` and `skip`. Note that `skip` can't be used at the same time with
`dropLast` as true. Because they will conflict in encoded vector.
## How was this patch tested?
Added new test suite `OneHotEncoderEstimatorSuite`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/viirya/spark-1 SPARK-13030
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19527.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19527
----
commit 8fd4677fd0e729d99d8777010e78bb5cfea3cf86
Author: Liang-Chi Hsieh <[email protected]>
Date: 2017-10-18T07:31:32Z
Add OneHotEncoderEstimator and related tests.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]