[
https://issues.apache.org/jira/browse/SPARK-20949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-20949.
-------------------------------
Resolution: Invalid
Questions belong on the mailing list. Intuitively, you can see that the last
encoding column is knowable from the others, so is redundant. The reason the
column vectors (not rows) end up linearly dependent is because of the intercept
term. See a good explanation at
http://www.algosome.com/articles/dummy-variable-trap-regression.html
> Is there another reason for the onehotencoder is different from scikit learn
> than specified in scaladoc?
> --------------------------------------------------------------------------------------------------------
>
> Key: SPARK-20949
> URL: https://issues.apache.org/jira/browse/SPARK-20949
> Project: Spark
> Issue Type: Question
> Components: ML
> Affects Versions: 1.6.2
> Reporter: Sungjun Kim
> Priority: Minor
>
> Spark OneHotEncoder is different from that of scikit learn.
> It makes an entry into zero vector.
> In scaladoc, there is a reason for this. It says that "it makes the vector
> entries sum up to one, and hence linearly dependent." But I don't think this
> is correct. Consider vectors [1.0, 0.0], [0.0, 1.0]. They sums 1 but are
> linearly independent obviously. Am I missing something? or Is there any other
> reason?
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]