Xiangrui Meng created SPARK-5888:
------------------------------------

             Summary: Add OneHotEncoder
                 Key: SPARK-5888
                 URL: https://issues.apache.org/jira/browse/SPARK-5888
             Project: Spark
          Issue Type: Sub-task
          Components: ML
            Reporter: Xiangrui Meng


`OneHotEncoder` takes a categorical column and output a vector column, which 
stores the category info in binaries.

{code}
val ohe = new OneHotEncoder()
  .setInputCol("countryIndex")
  .setOutputCol("countries")
{code}

It should read the category info from the metadata and assign feature names 
properly in the output column. We need to discuss the default naming scheme and 
whether we should let it process multiple categorical columns at the same time.

One category (the most frequent one) should be removed from the output to make 
the output columns linear independent. Or this could be an option tuned on by 
default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to