Xiangrui Meng created SPARK-5888:
------------------------------------
Summary: Add OneHotEncoder
Key: SPARK-5888
URL: https://issues.apache.org/jira/browse/SPARK-5888
Project: Spark
Issue Type: Sub-task
Components: ML
Reporter: Xiangrui Meng
`OneHotEncoder` takes a categorical column and output a vector column, which
stores the category info in binaries.
{code}
val ohe = new OneHotEncoder()
.setInputCol("countryIndex")
.setOutputCol("countries")
{code}
It should read the category info from the metadata and assign feature names
properly in the output column. We need to discuss the default naming scheme and
whether we should let it process multiple categorical columns at the same time.
One category (the most frequent one) should be removed from the output to make
the output columns linear independent. Or this could be an option tuned on by
default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]