GitHub user vlad17 opened a pull request:
https://github.com/apache/spark/pull/14547
[SPARK-16718][MLlib] gbm-style treeboost [WIP]
## What changes were proposed in this pull request?
This change adds TreeBoost functionality to `GBTClassifer` and
`GBTRegressor`. The main change is that leaf nodes now make a prediction which
optimizes the loss function, rather than always using the mean label (which is
only optimal in the case of variance-based impurity).
This changes the defaults to use the loss-based impurity rather than the
required variance.
I made this change only for L2 loss and logistic loss (adding some aliases
to the names as well for parity with R's implementation, GBM). These two
functions have leaf predictions that can be computed within the framework of
the current impurity API. Other loss functions will require API modification,
which should be its own change, SPARK-16728.
Note that because loss-based impurity with L1 loss is NOT supported, code
that only sets default impurity and L1 loss will now throw (impurity should be
variance, explicitly).
## How was this patch tested?
Unit testing for correctness: I tested defaults parameter values and new
settings for the parameters.
[WIP] For accuracy, I'm currently comparing the performance on a [real-life
dataset](https://www.datarobot.com/blog/r-getting-started-with-data-science/)
between Spark and GBM. I will upload the results once I have them.
[WIP] This code shouldn't introduce any regressions, but it would be nice
to make sure. I'm waiting for @sethah to respond on [his previous
PR](https://github.com/apache/spark/commit/dafd70fbfe70702502ef198f2a8f529ef7557592)
so that he can make his benchmarking script available to me.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vlad17/spark GBT-1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/14547.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #14547
----
commit 6c7c60b581464be13b44aa43d2c402501fdb0505
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-22T01:01:58Z
Added new documentation for TreeBoost, top-level calls
commit a4c050675bc524b742cb9fc3703ce5105cabdd8a
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-22T19:55:10Z
Implemented ApproxBernoulliImpurity
commit 5a38e0c1b284423f3129c4edbacece562fb675a3
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-25T22:59:19Z
Added approximate Bernoulli impurity (L_2 treeboost)
commit 759d1aa1a20c1679fba212c3017e200d386fa6da
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-26T00:21:22Z
Added marker saying Laplace Impurity is not yet supported (requires
internal API change)
commit e027d6dedd928e96dc7c99dc699d9f7c374034a3
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-26T00:26:29Z
Updated docs to reflect lack of L1 impurity support
commit 15575a13c0ad4f2567bcccdcbcb134a9ca548d9c
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-26T00:41:00Z
Fixed urls
commit 7c7d804dc3c614984e863aae9ef8ffc8f9ec3117
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-26T00:43:46Z
Removed ApproxLaplaceImpurity
commit 44a58efe4b0b1bd69eaadc5dc17676194b949888
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-26T00:50:50Z
Fix reader docs
commit b362c3852c0e17783b08a9c9a97e1abb66ef5c9f
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-26T23:43:41Z
Fixed a bunch of bugs + tested wrt old behavior
commit f31903c228c164313c2f0cb22fac8b81effff6a1
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-27T00:47:51Z
Completed tests for reading/writing new impurities
commit 01eae2ae967fdbe89b0ecd440216e54431d51d3d
Author: Vladimir Feinberg <[email protected]>
Date: 2016-07-27T17:15:05Z
Finished tests
commit bd189e2aae27266314b16f0dffc3ce7a230d4e27
Author: Vladimir Feinberg <[email protected]>
Date: 2016-08-06T23:16:18Z
Added R's gbm as a direct comparison to GBTClassifier
commit 704864354619581f1f5bb43489c5e2ee9ec89487
Author: Vladimir Feinberg <[email protected]>
Date: 2016-08-07T00:20:35Z
Got rid of direct R comparison
commit a0a8fcddefa122682c579b567524cbcf2b00251c
Author: Vladimir Feinberg <[email protected]>
Date: 2016-08-08T06:18:14Z
Direct behavior-checking test (for GBTClassifier)
commit c050586e7db6eed41f5b8ddf1e245b13be2c8994
Author: Vladimir Feinberg <[email protected]>
Date: 2016-08-08T20:44:42Z
Added analogous test for GBTReressor
commit 7e39ada3acf431c171adfca0603279002ff20153
Author: Vladimir Feinberg <[email protected]>
Date: 2016-08-08T21:03:47Z
Cleaned up style-related things
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]