Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    @hhbyyh can you elaborate on your concerns in comment 
https://github.com/apache/spark/pull/18513#pullrequestreview-50194532?
    
    I tend to agree that the hasher is perhaps best used for categorical 
features, while known real features could be "assembled" onto the resulting 
hashed feature vector. However, one nice thing about hashing is it can handle 
everything at once in one pass. In practice even with very high cardinality 
categorical features and some real features, for the "normal" settings of hash 
bits, hash collision rate is relatively low, and has very little impact on 
performance (at least from my experiments). Of course it assumes highly sparse 
data - if the data is not sparse then it's usually best to use other mechanisms.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to