GitHub user hhbyyh opened a pull request:
https://github.com/apache/spark/pull/10803
[SPARK-12875] [ML] Add Weight of Evidence and Information value to Spark.ml
as a feature transformer
jira: https://issues.apache.org/jira/browse/SPARK-12875
As a feature transformer, WOE and IV enable one to:
Consider each variableâs independent contribution to the outcome.
Detect linear and non-linear relationships.
Rank variables in terms of "univariate" predictive strength.
Visualize the correlations between the predictive variables and the binary
outcome.
http://multithreaded.stitchfix.com/blog/2015/08/13/weight-of-evidence/
gives a good introduction to WoE and IV.
The Weight of Evidence or WoE value provides a measure of how well a
grouping of feature is able to distinguish between a binary response (e.g.
"good" versus "bad"), which is widely used in grouping continuous feature or
mapping categorical features to continuous values. It is computed from the
basic odds ratio:
(Distribution of positive Outcomes) / (Distribution of negative Outcomes)
where Distr refers to the proportion of positive or negative in the
respective group, relative to the column totals.
The WoE recoding of features is particularly well suited for subsequent
modeling using Logistic Regression or MLP.
In addition, the information value or IV can be computed based on WoE,
which is a popular technique to select variables in a predictive model.
Next: Currently we support only calculation for categorical features. Add
an estimator to estimate the proper grouping for continuous feature.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/hhbyyh/spark woe
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/10803.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #10803
----
commit 0b360c4f54ee23efd5c29785e77d75217b5a0893
Author: Yuhao Yang <[email protected]>
Date: 2016-01-14T09:43:52Z
draft for woe
commit a674bb0190a07c9af1f210ae7acba89d1188be57
Author: Yuhao Yang <[email protected]>
Date: 2016-01-14T15:49:05Z
add iv
commit c2beb8b51a9a80f94da9de59f56988647050addf
Author: Yuhao Yang <[email protected]>
Date: 2016-01-16T08:36:05Z
Merge remote-tracking branch 'upstream/master' into woe
commit c6239383914a4c8bde2c4afb22398399803e55b0
Author: Yuhao Yang <[email protected]>
Date: 2016-01-17T06:38:51Z
woe and ut
commit ab3a961311672d70360fd4a322c42c92945b6ca6
Author: Yuhao Yang <[email protected]>
Date: 2016-01-17T06:38:58Z
Merge remote-tracking branch 'upstream/master' into woe
commit 11f3f5a12659b0b5028f37e1542d33130ba1459e
Author: Yuhao Yang <[email protected]>
Date: 2016-01-17T16:27:31Z
add require
commit f1f118b73950415e7326e744b1b17112942976fb
Author: Yuhao Yang <[email protected]>
Date: 2016-01-18T07:02:03Z
Merge remote-tracking branch 'upstream/master' into woe
commit 8bb38abe79e03490e79cfe31b86607d93818cb27
Author: Yuhao Yang <[email protected]>
Date: 2016-01-18T09:18:27Z
style fix
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]