GitHub user MLnick opened a pull request:

    https://github.com/apache/spark/pull/19991

    [SPARK-22801][ML][PYSPARK] Allow FeatureHasher to treat numeric columns as 
categorical

    Previously, `FeatureHasher` always treats numeric type columns as numbers 
and never as categorical features. It is quite common to have categorical 
features represented as numbers or codes in data sources.
    
    In order to hash these features as categorical, users must first explicitly 
convert them to strings which is cumbersome.
    
    Add a new param `categoricalCols` which specifies the numeric columns that 
should be treated as categorical features.
    
    ## How was this patch tested?
    
    New unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MLnick/spark hasher-num-cat

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19991.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19991
    
----
commit 5a57965a89ae16feae6adf413925a2c9de995ba1
Author: Nick Pentreath <[email protected]>
Date:   2017-12-15T12:43:16Z

    Add categoricalCols param

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to