[GitHub] spark pull request: SPARK-1216. Add a OneHotEncoder for handling c...

sryza Sun, 13 Apr 2014 02:20:07 -0700

Github user sryza commented on the pull request:

    https://github.com/apache/spark/pull/304#issuecomment-40303204
  
    Thanks for taking a look @mengxr.  Working on a patch that addresses the 
inline comments.  On the broader points:
    
    > We should spend more time on the data types.
    Agreed. It would probably make sense to have some way of accepting sparse 
input, maybe just Map[Int, T]?
    
    > Using Array to store features would result reallocation of memory.
    Do you mind elaborating on this a little more?  How can we avoid the 
reallocation?
    
    > The output of one-hot is always sparse, we should use sparse vector 
instead of dense.
    While one-hot increases the sparsity, in many cases a dense representation 
is still more efficient.  I'm not sure where the boundary lies, but, in the 
extreme case, a long dense vector with few categorical variables that take on 
only a few categories will still do better with a dense representation after 
the transformation.  In my opinion, we should give the user control and default 
to only outputting sparse vectors if the input type is sparse.  What do you 
think?




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: SPARK-1216. Add a OneHotEncoder for handling c...

Reply via email to