GitHub user MLnick opened a pull request:
https://github.com/apache/spark/pull/18513
[SPARK-13969][ML] Add FeatureHasher transformer
This PR adds a `FeatureHasher` transformer, modeled on
[scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html)
and [Vowpal
wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
The transformer operates on multiple input columns in one pass. Current
behavior is:
* for numerical columns, the values are assumed to be real values and the
feature index is `hash(columnName)` while feature value is `feature_value`
* for string columns, the values are assumed to be categorical and the
feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
* For hash collisions, feature values will be summed
* `null` (missing) values are ignored
The following dataframe illustrates the basic semantics:
```
+---+------+-----+---------+------+-----------------------------------------+
|int|double|float|stringNum|string|features
|
+---+------+-----+---------+------+-----------------------------------------+
|3 |4.0 |5.0 |1 |foo
|(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
|6 |7.0 |8.0 |2 |bar
|(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
+---+------+-----+---------+------+-----------------------------------------+
```
## How was this patch tested?
New unit tests and manual experiments.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MLnick/spark FeatureHasher
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18513.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18513
----
commit 6ab19a963f35de29af0a6b7b1598d5add78f200a
Author: Nick Pentreath <[email protected]>
Date: 2016-08-23T10:29:06Z
initial WIP
commit ebd2cbf3467f26121c602f7c77c2018253cbdf18
Author: Nick Pentreath <[email protected]>
Date: 2017-02-01T10:43:07Z
Further work
commit ba255bfda792d58aaded892e49c6cf48f0391159
Author: Nick Pentreath <[email protected]>
Date: 2017-06-22T10:52:12Z
Clean up
commit 0be1e6572110d7d550f69fd86d3dd4e96660fde6
Author: Nick Pentreath <[email protected]>
Date: 2017-06-22T10:52:37Z
Add tests
commit 2f3ea21e2e1835d7218e8c7bd096cc0787ed595c
Author: Nick Pentreath <[email protected]>
Date: 2017-06-22T13:08:26Z
Copy, save/load, clean up
commit 7d678fbf5f88d377b79153212a3e0a2596039b17
Author: Nick Pentreath <[email protected]>
Date: 2017-06-26T12:38:02Z
Move numFeatures to HasNumFeatures shared trait
commit 60572776de80ebcf1782c3d7def749557c8bec61
Author: Nick Pentreath <[email protected]>
Date: 2017-07-03T07:18:25Z
Update shared params from codegen run
commit 9edb3bda8cbc4e00f05b91718249edf2750fc028
Author: Nick Pentreath <[email protected]>
Date: 2017-07-03T09:32:32Z
Update tests. Null values ignored in feature hashing.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]