Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/18513
**Note 1**: this is distinct from `HashingTF` which handles vectorizing
text to term frequencies (analogous to
[HashingVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)).
Thie feature hasher _could_ be extended to also handle `Seq[String]` input
columns. But I feel it conflates concerns - e.g. `HashingTF` handles min term
frequencies, binarization etc.
However we could later add basic support for `Seq[String]` columns - this
would handle raw text in a similar way to Vowpal Wabbit, i.e. it all gets
hashed into one feature vector (can be combined with namespaces later).
**Note 2**: some potential follow ups:
* support specifying categorical columns explicitly. This would be to allow
forcing some columns that are in numerical format to be treated as categorical.
Strings would still be treated as categorical.
* support using the sign of hashed value as sign of feature value, and then
support `non_negative` param (see
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html))
* support feature namespaces and feature interactions similar to [Vowpal
Wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-interactions)
(see [here](https://gist.github.com/luoq/b4c374b5cbabe3ae76ffacdac22750af) for
an outline of the code used). This could provide an efficient and scalable form
of `PolynomialExpansion`.
cc @srowen @jkbradley @sethah @hhbyyh @yanboliang @BryanCutler @holdenk
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]