[ 
https://issues.apache.org/jira/browse/SPARK-20949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-20949.
-------------------------------
    Resolution: Invalid

Questions belong on the mailing list. Intuitively, you can see that the last 
encoding column is knowable from the others, so is redundant. The reason the 
column vectors (not rows) end up linearly dependent is because of the intercept 
term. See a good explanation at 
http://www.algosome.com/articles/dummy-variable-trap-regression.html

> Is there another reason for the onehotencoder is different from scikit learn 
> than specified in scaladoc?
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-20949
>                 URL: https://issues.apache.org/jira/browse/SPARK-20949
>             Project: Spark
>          Issue Type: Question
>          Components: ML
>    Affects Versions: 1.6.2
>            Reporter: Sungjun Kim
>            Priority: Minor
>
> Spark OneHotEncoder is different from that of scikit learn. 
> It makes an entry into zero vector.
> In scaladoc, there is a reason for this. It says that "it makes the vector 
> entries sum up to one, and hence linearly dependent." But I don't think this 
> is correct. Consider vectors [1.0, 0.0], [0.0, 1.0]. They sums 1 but are 
> linearly independent obviously. Am I missing something? or Is there any other 
> reason?



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to