Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/18513
@hhbyyh can you elaborate on your concerns in comment
https://github.com/apache/spark/pull/18513#pullrequestreview-50194532?
I tend to agree that the hasher is perhaps best used for categorical
features, while known real features could be "assembled" onto the resulting
hashed feature vector. However, one nice thing about hashing is it can handle
everything at once in one pass. In practice even with very high cardinality
categorical features and some real features, for the "normal" settings of hash
bits, hash collision rate is relatively low, and has very little impact on
performance (at least from my experiments). Of course it assumes highly sparse
data - if the data is not sparse then it's usually best to use other mechanisms.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]