GitHub user MLnick opened a pull request:
https://github.com/apache/spark/pull/19991
[SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat numeric columns as
categorical
Previously, `FeatureHasher` always treats numeric type columns as numbers
and never as categorical features. It is quite common to have categorical
features represented as numbers or codes in data sources.
In order to hash these features as categorical, users must first explicitly
convert them to strings which is cumbersome.
Add a new param `categoricalCols` which specifies the numeric columns that
should be treated as categorical features.
## How was this patch tested?
New unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/MLnick/spark hasher-num-cat
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19991.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19991
----
commit 5a57965a89ae16feae6adf413925a2c9de995ba1
Author: Nick Pentreath <[email protected]>
Date: 2017-12-15T12:43:16Z
Add categoricalCols param
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]