GitHub user actuaryzhang opened a pull request:

    https://github.com/apache/spark/pull/17967

    [SPARK-14659] RFormula allows to drop the same category as R when handling 
strings

    ## What changes were proposed in this pull request?
    When handling strings, the category dropped by RFormula and R are different:
    - RFormula drops the least frequent level 
    - R drops the last level after ascending alphabetical ordering 
    
    This PR supports different string ordering type in StringIndexer #17879 so 
that RFormula can drop the same level as R when handling strings 
when`stringOrderType = "alphabetAsc"`. 
    
    ## How was this patch tested?
    new tests 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/actuaryzhang/spark RFormula

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17967.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17967
    
----
commit 4d27123926ee87231a73aea9dc34555c404c7f1b
Author: Wayne Zhang <[email protected]>
Date:   2017-05-12T07:52:13Z

    add stringOrderType to RFormula

commit 6841c33768adf1b1397dc5aa36e34abdb8d6ff8a
Author: Wayne Zhang <[email protected]>
Date:   2017-05-12T16:30:12Z

    clean up import

commit 77fe864770420719d396715479fc1f452a80b8da
Author: Wayne Zhang <[email protected]>
Date:   2017-05-12T17:48:44Z

    add comparison to R

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to