Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/304#issuecomment-40303204
Thanks for taking a look @mengxr. Working on a patch that addresses the
inline comments. On the broader points:
> We should spend more time on the data types.
Agreed. It would probably make sense to have some way of accepting sparse
input, maybe just Map[Int, T]?
> Using Array to store features would result reallocation of memory.
Do you mind elaborating on this a little more? How can we avoid the
reallocation?
> The output of one-hot is always sparse, we should use sparse vector
instead of dense.
While one-hot increases the sparsity, in many cases a dense representation
is still more efficient. I'm not sure where the boundary lies, but, in the
extreme case, a long dense vector with few categorical variables that take on
only a few categories will still do better with a dense representation after
the transformation. In my opinion, we should give the user control and default
to only outputting sparse vectors if the input type is sparse. What do you
think?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---