GitHub user actuaryzhang opened a pull request:
https://github.com/apache/spark/pull/17967
[SPARK-14659] RFormula allows to drop the same category as R when handling
strings
## What changes were proposed in this pull request?
When handling strings, the category dropped by RFormula and R are different:
- RFormula drops the least frequent level
- R drops the last level after ascending alphabetical ordering
This PR supports different string ordering type in StringIndexer #17879 so
that RFormula can drop the same level as R when handling strings
when`stringOrderType = "alphabetAsc"`.
## How was this patch tested?
new tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/actuaryzhang/spark RFormula
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17967.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17967
----
commit 4d27123926ee87231a73aea9dc34555c404c7f1b
Author: Wayne Zhang <[email protected]>
Date: 2017-05-12T07:52:13Z
add stringOrderType to RFormula
commit 6841c33768adf1b1397dc5aa36e34abdb8d6ff8a
Author: Wayne Zhang <[email protected]>
Date: 2017-05-12T16:30:12Z
clean up import
commit 77fe864770420719d396715479fc1f452a80b8da
Author: Wayne Zhang <[email protected]>
Date: 2017-05-12T17:48:44Z
add comparison to R
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]