GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/6466

    [SPARK-7912] [MLLIB] Update OneHotEncoder to handle ML attributes and 
change includeFirst to dropLast

    This PR contains two major changes to `OneHotEncoder`:
    
    1. more robust handling of ML attributes. If the input attribute is 
unknown, we look at the values to get the max category index
    2. change `includeFirst` to `dropLast` and leave the default to `true`. 
There are couple benefits:
      a. consistent with other tutorials of one-hot encoding (or dummy coding) 
(e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
      b. keep the indices unmodified in the output vector. If we drop the 
first, all indices will be shifted by 1.
      c. If users use `StringIndex`, the last element is the least frequent one.
    
    I'll update the user guide in another PR.
    
    @jkbradley @sryza 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-7912

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6466.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6466
    
----
commit 208ddad429191cf454b103faef9fa01b67ce3e89
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-28T17:53:15Z

    update OneHotEncoder to handle ML attributes and change includeFirst to 
dropLast

commit d5ac64bcb806eeda836dd26aaca06c065a1a5a5b
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-28T18:53:33Z

    update OneHotEncoder in Python

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to