[
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng resolved SPARK-5888.
----------------------------------
Resolution: Fixed
Fix Version/s: 1.4.0
Issue resolved by pull request 5500
[https://github.com/apache/spark/pull/5500]
> Add OneHotEncoder as a Transformer
> ----------------------------------
>
> Key: SPARK-5888
> URL: https://issues.apache.org/jira/browse/SPARK-5888
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Xiangrui Meng
> Assignee: Sandy Ryza
> Fix For: 1.4.0
>
>
> `OneHotEncoder` takes a categorical column and output a vector column, which
> stores the category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
> .setInputCol("countryIndex")
> .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names
> properly in the output column. We need to discuss the default naming scheme
> and whether we should let it process multiple categorical columns at the same
> time.
> One category (the most frequent one) should be removed from the output to
> make the output columns linear independent. Or this could be an option tuned
> on by default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]