[
https://issues.apache.org/jira/browse/SPARK-5886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xiangrui Meng updated SPARK-5886:
---------------------------------
Description:
`StringIndexer` takes a column of string labels (raw categories) and outputs an
integer column with labels indexed by their frequency.
{code}
va li = new StringIndexer()
.setInputCol("country")
.setOutputCol("countryIndex")
{code}
In the output column, we should store the label to index map as an ML
attribute. The index should be ordered by frequency, where the most frequent
label gets index 0, to enhance sparsity.
We can discuss whether this should index multiple columns at the same time.
was:
`LabelIndexer` takes a column of labels (raw categories) and outputs an integer
column with labels indexed by their frequency.
{code}
va li = new LabelIndexer()
.setInputCol("country")
.setOutputCol("countryIndex")
{code}
In the output column, we should store the label to index map as an ML
attribute. The index should be ordered by frequency, where the most frequent
label gets index 0, to enhance sparsity.
We can discuss whether this should index multiple columns at the same time.
> Add StringIndexer
> -----------------
>
> Key: SPARK-5886
> URL: https://issues.apache.org/jira/browse/SPARK-5886
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Xiangrui Meng
> Assignee: Xiangrui Meng
> Fix For: 1.4.0
>
>
> `StringIndexer` takes a column of string labels (raw categories) and outputs
> an integer column with labels indexed by their frequency.
> {code}
> va li = new StringIndexer()
> .setInputCol("country")
> .setOutputCol("countryIndex")
> {code}
> In the output column, we should store the label to index map as an ML
> attribute. The index should be ordered by frequency, where the most frequent
> label gets index 0, to enhance sparsity.
> We can discuss whether this should index multiple columns at the same time.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]