[
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541339#comment-14541339
]
Sandy Ryza commented on SPARK-5888:
-----------------------------------
Hi [~hvanhovell], I agree that this should work. [~mengxr], any thoughts on
the best way to solve this?
> Add OneHotEncoder as a Transformer
> ----------------------------------
>
> Key: SPARK-5888
> URL: https://issues.apache.org/jira/browse/SPARK-5888
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Xiangrui Meng
> Assignee: Sandy Ryza
> Fix For: 1.4.0
>
>
> `OneHotEncoder` takes a categorical column and output a vector column, which
> stores the category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
> .setInputCol("countryIndex")
> .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names
> properly in the output column. We need to discuss the default naming scheme
> and whether we should let it process multiple categorical columns at the same
> time.
> One category (the most frequent one) should be removed from the output to
> make the output columns linear independent. Or this could be an option tuned
> on by default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]