GitHub user vlad17 opened a pull request:

    https://github.com/apache/spark/pull/14547

    [SPARK-16718][MLlib] gbm-style treeboost [WIP]

    ## What changes were proposed in this pull request?
    
    This change adds TreeBoost functionality to `GBTClassifer` and 
`GBTRegressor`. The main change is that leaf nodes now make a prediction which 
optimizes the loss function, rather than always using the mean label (which is 
only optimal in the case of variance-based impurity).
    
    This changes the defaults to use the loss-based impurity rather than the 
required variance.
    
    I made this change only for L2 loss and logistic loss (adding some aliases 
to the names as well for parity with R's implementation, GBM). These two 
functions have leaf predictions that can be computed within the framework of 
the current impurity API. Other loss functions will require API modification, 
which should be its own change, SPARK-16728.
    
    Note that because loss-based impurity with L1 loss is NOT supported, code 
that only sets default impurity and L1 loss will now throw (impurity should be 
variance, explicitly).
    
    ## How was this patch tested?
    
    Unit testing for correctness: I tested defaults parameter values and new 
settings for the parameters.
    
    [WIP] For accuracy, I'm currently comparing the performance on a [real-life 
dataset](https://www.datarobot.com/blog/r-getting-started-with-data-science/) 
between Spark and GBM. I will upload the results once I have them.
    [WIP] This code shouldn't introduce any regressions, but it would be nice 
to make sure. I'm waiting for @sethah to respond on [his previous 
PR](https://github.com/apache/spark/commit/dafd70fbfe70702502ef198f2a8f529ef7557592)
 so that he can make his benchmarking script available to me.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/vlad17/spark GBT-1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/14547.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #14547
    
----
commit 6c7c60b581464be13b44aa43d2c402501fdb0505
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-22T01:01:58Z

    Added new documentation for TreeBoost, top-level calls

commit a4c050675bc524b742cb9fc3703ce5105cabdd8a
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-22T19:55:10Z

    Implemented ApproxBernoulliImpurity

commit 5a38e0c1b284423f3129c4edbacece562fb675a3
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-25T22:59:19Z

    Added approximate Bernoulli impurity (L_2 treeboost)

commit 759d1aa1a20c1679fba212c3017e200d386fa6da
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-26T00:21:22Z

    Added marker saying Laplace Impurity is not yet supported (requires 
internal API change)

commit e027d6dedd928e96dc7c99dc699d9f7c374034a3
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-26T00:26:29Z

    Updated docs to reflect lack of L1 impurity support

commit 15575a13c0ad4f2567bcccdcbcb134a9ca548d9c
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-26T00:41:00Z

    Fixed urls

commit 7c7d804dc3c614984e863aae9ef8ffc8f9ec3117
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-26T00:43:46Z

    Removed ApproxLaplaceImpurity

commit 44a58efe4b0b1bd69eaadc5dc17676194b949888
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-26T00:50:50Z

    Fix reader docs

commit b362c3852c0e17783b08a9c9a97e1abb66ef5c9f
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-26T23:43:41Z

    Fixed a bunch of bugs + tested wrt old behavior

commit f31903c228c164313c2f0cb22fac8b81effff6a1
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-27T00:47:51Z

    Completed tests for reading/writing new impurities

commit 01eae2ae967fdbe89b0ecd440216e54431d51d3d
Author: Vladimir Feinberg <[email protected]>
Date:   2016-07-27T17:15:05Z

    Finished tests

commit bd189e2aae27266314b16f0dffc3ce7a230d4e27
Author: Vladimir Feinberg <[email protected]>
Date:   2016-08-06T23:16:18Z

    Added R's gbm as a direct comparison to GBTClassifier

commit 704864354619581f1f5bb43489c5e2ee9ec89487
Author: Vladimir Feinberg <[email protected]>
Date:   2016-08-07T00:20:35Z

    Got rid of direct R comparison

commit a0a8fcddefa122682c579b567524cbcf2b00251c
Author: Vladimir Feinberg <[email protected]>
Date:   2016-08-08T06:18:14Z

    Direct behavior-checking test (for GBTClassifier)

commit c050586e7db6eed41f5b8ddf1e245b13be2c8994
Author: Vladimir Feinberg <[email protected]>
Date:   2016-08-08T20:44:42Z

    Added analogous test for GBTReressor

commit 7e39ada3acf431c171adfca0603279002ff20153
Author: Vladimir Feinberg <[email protected]>
Date:   2016-08-08T21:03:47Z

    Cleaned up style-related things

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to