[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

MLnick Mon, 03 Jul 2017 03:11:01 -0700

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/18513
  
    **Note 1**: this is distinct from `HashingTF` which handles vectorizing 
text to term frequencies (analogous to 
[HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)).
 Thie feature hasher _could_ be extended to also handle `Seq[String]` input 
columns. But I feel it conflates concerns - e.g. `HashingTF` handles min term 
frequencies, binarization etc. 
    
    However we could later add basic support for `Seq[String]` columns - this 
would handle raw text in a similar way to Vowpal Wabbit, i.e. it all gets 
hashed into one feature vector (can be combined with namespaces later).
    
    **Note 2**: some potential follow ups:
    * support specifying categorical columns explicitly. This would be to allow 
forcing some columns that are in numerical format to be treated as categorical. 
Strings would still be treated as categorical.
    * support using the sign of hashed value as sign of feature value, and then 
support `non_negative` param (see 
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html))
    * support feature namespaces and feature interactions similar to [Vowpal 
Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-interactions)
 (see [here](https://gist.github.com/luoq/b4c374b5cbabe3ae76ffacdac22750af) for 
an outline of the code used). This could provide an efficient and scalable form 
of `PolynomialExpansion`.
    
    cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18513: [SPARK-13969][ML] Add FeatureHasher transformer

Reply via email to