[
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541367#comment-14541367
]
Sandy Ryza commented on SPARK-5888:
-----------------------------------
Right, but while the values are unknown at first, they will become known at
some point during the execution (after StringIndexer.fit completes). So it
seems like at some point it would be good to pass these values down.
Put another way, it seems bad to me that the user should see different behavior
with regard to attribute values if they use OneHotEncoder in the Pipeline way,
as described by Herman above, vs in the slightly more verbose way where
StringIndexer.transform is explicitly called first:
https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala#L50.
I'm relatively unfamiliar with these APIs, so apologies if I'm not making sense.
> Add OneHotEncoder as a Transformer
> ----------------------------------
>
> Key: SPARK-5888
> URL: https://issues.apache.org/jira/browse/SPARK-5888
> Project: Spark
> Issue Type: Sub-task
> Components: ML
> Reporter: Xiangrui Meng
> Assignee: Sandy Ryza
> Fix For: 1.4.0
>
>
> `OneHotEncoder` takes a categorical column and output a vector column, which
> stores the category info in binaries.
> {code}
> val ohe = new OneHotEncoder()
> .setInputCol("countryIndex")
> .setOutputCol("countries")
> {code}
> It should read the category info from the metadata and assign feature names
> properly in the output column. We need to discuss the default naming scheme
> and whether we should let it process multiple categorical columns at the same
> time.
> One category (the most frequent one) should be removed from the output to
> make the output columns linear independent. Or this could be an option tuned
> on by default.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]