[ 
https://issues.apache.org/jira/browse/SPARK-4240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366502#comment-15366502
 ] 

Vladimir Feinberg edited comment on SPARK-4240 at 7/22/16 4:47 PM:
-------------------------------------------------------------------

Pending some dramatic response from [~sethah] telling me to back off, I'll take 
over this one. [~josephkb], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, 
the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the 
loss-based impurity by default, but can be overridden to use standard RFs if 
desired.
 * Create a generic loss-reduction calculator which works by reducing a 
parametrizable loss criterion (or, rather, a Taylor approximation of it as 
recommended by Friedman \[1\] and implemented to the second order by XGBoost 
\[2\] \[code: 5\]).
 * Make loss-reduction calculator for regression:
 ** Add squared and absolute losses
 **  'loss-based' induces a second-order approximation for squared loss, and 
only a first-order approximation for absolute loss
 ** The former should perform like LS_Boost from \[1\] and the latter is 
sort-of (*) equivalent to LAD_TreeBoost from \[1\]. Both these "generic loss" 
instantiations become new impurities that the user could set, just like 'gini' 
or 'entropy'. This calculator will implement corresponding terminal-leaf 
predictions, either the mean or median of the leaf's sample. Computing the 
median may require modifications to the internal developer API so that at some 
point the calculator can access the entire set of training samples a terminal 
node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss 
inducing a new impurity. Second order here is again feasible. First order 
corresponds to sort-of (*) L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for 
them.

(*) A note regarding the sort-of equivalence with Friedman: in his 1999 paper, 
Friedman admits that he's not doing "true" TreeBoost because he builds the tree 
based on variance reduction of the the residuals. This is exactly what \[3\] 
does. \[2\] instead builds the tree by optimizing a _taylor approximation_ for 
the losses, which makes it feasible to efficiently consider many splits in a 
leaf (because of the additive nature of the approximate loss function).

* For logistic, this works really well for XGBoost.
* For squared error, both approaches are equivalent
* For absolute error, the Taylor approximation can be first-order only (but 
locally, it's a perfect approximation). I don't think anyone has done even this 
approximate version of "true" L1 TreeBoost before. It may be necessary to go 
the way gbm does and use variance impurity, but we'll try it out anyway.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, 
logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is 
implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think 
this is even technically TreeBoost. Such behavior should be emulatable in the 
new code by overriding impurity='variance' (would be nice to see if we have 
comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data 
weights, so for both l1 and l2 losses terminal node computations don't need 
Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably 
involves adding a regularization param to GBTs, creating new 
regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = 
math.max(metadata.maxBins * metadata.maxBins, 10000)}} performs row subsampling 
on our data. I don't know if it's sound from a statistical learning 
perspective, but this is something that we should take a look at (i.e., does 
performing a precise sample complexity calculation in the PAC sense lead to 
better perf)?
3. Add different "losses" corresponding to residual distributions - see all the 
ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to 
implement NR optimization. Huber loss is the only one mentioned in \[1\] that 
we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: 
https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] 
https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf



was (Author: vlad.feinberg):
Pending some dramatic response from [~sethah] telling me to back off, I'll take 
over this one. [~josephkb], mind reviewing the below outline?

I propose that this JIRA be resolved in the following manner:
API Change: Since a true "TreeBoost" splits on the impurity of loss reduction, 
the impurity calculator should be derived from the loss function itself.
 * Set a new default for impurity param in GBTs as 'auto', which uses the 
loss-based impurity by default, but can be overridden to use standard RFs if 
desired.
 * Create a generic loss-reduction calculator which works by reducing a 
parametrizable loss criterion (or, rather, a Taylor approximation of it as 
recommended by Friedman \[1\] and implemented to the second order by XGBoost 
\[2\] \[code: 5\]).
 * Instantiate the generic loss-reduction calculator (that supports different 
orders of losses) for regression:
 ** Add squared and absolute losses
 **  'auto' induces a second-order approximation for squared loss, and only a 
first-order approximation for absolute loss
 ** The former should perform better than LS_Boost from \[1\] (which only uses 
the first-order approximation) and the latter is equivalent to LAD_TreeBoost 
from \[1\]. It may be worthwhile to add an LS_Boost impurity and check it 
performs worse. Both these "generic loss" instantiations become new impurities 
that the user could set, just like 'gini' or 'entropy'. This calculator will 
implement corresponding terminal-leaf predictions, either the mean or median of 
the leaf's sample. Computing the median may require modifications to the 
internal developer API so that at some point the calculator can access the 
entire set of training samples a terminal node's partition corresponds to.
 * On the classifier side we need to do the same thing, with a logistic loss 
inducing a new impurity. Second order here is again feasible. First order 
corresponds to L2_TreeBoost from \[1\].
 * Because the new impurities apply only to GBTs, they'll only be available for 
them.

Questions for [~josephkb]:
1. Should I ditch making the second order approximation that \[2\] does? It 
won't make the code any simpler, but might make the theoretical offerings of 
the new easier to grasp. This would add another task "try out second order 
Taylor approx" to the below, and also means we won't perform as well as xgb 
until the second order thing happens.

Note that the L2 loss corresponds to gaussian regression, L1 to Laplace, 
logistic to bernoulli. I'll add the aliases to loss.

Differences between this and \[2\]:
* No leaf weight regularization, besides the default constant shrinkage, is 
implemented.

Differences between this and \[3\]:
* \[3\] uses variance impurity for split selection \[code: 6\]. I don't think 
this is even technically TreeBoost. Such behavior should be emulatable in the 
new code by overriding impurity='variance' (would be nice to see if we have 
comparable perf here).
* \[3\] implements GBTs for weighted in put data. We don't support data 
weights, so for both l1 and l2 losses terminal node computations don't need 
Newton-Raphson optimization.

Probably not for this JIRA:
1. Implementing leaf weights (and leaf weight regularization) - probably 
involves adding a regularization param to GBTs, creating new 
regularization-aware impurity calculators.
2. In {{RandomForest.scala}} the line {{val requiredSamples = 
math.max(metadata.maxBins * metadata.maxBins, 10000)}} performs row subsampling 
on our data. I don't know if it's sound from a statistical learning 
perspective, but this is something that we should take a look at (i.e., does 
performing a precise sample complexity calculation in the PAC sense lead to 
better perf)?
3. Add different "losses" corresponding to residual distributions - see all the 
ones supported here \[3\] \[4\] \[7\]. Depending what we add, we may need to 
implement NR optimization. Huber loss is the only one mentioned in \[1\] that 
we don't yet have.

\[1\] Friedman paper: https://statweb.stanford.edu/~jhf/ftp/trebst.pdf
\[2\] xgboost paper: http://www.kdd.org/kdd2016/papers/files/Paper_697.pdf
\[3\] gbm impl paper: http://www.saedsayad.com/docs/gbm2.pdf
\[4\] xgboost docs: 
https://xgboost.readthedocs.io/en/latest//parameter.html#general-parameters
\[5\] 
https://github.com/dmlc/xgboost/blob/1625dab1cbc416d9d9a79dde141e3d236060387a/src/objective/regression_obj.cc
\[6\] https://github.com/gbm-developers/gbm/blob/master/src/node_parameters.h
\[7\] gbm api: https://cran.r-project.org/web/packages/gbm/gbm.pdf


> Refine Tree Predictions in Gradient Boosting to Improve Prediction Accuracy.
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-4240
>                 URL: https://issues.apache.org/jira/browse/SPARK-4240
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Sung Chung
>
> The gradient boosting as currently implemented estimates the loss-gradient in 
> each iteration using regression trees. At every iteration, the regression 
> trees are trained/split to minimize predicted gradient variance. 
> Additionally, the terminal node predictions are computed to minimize the 
> prediction variance.
> However, such predictions won't be optimal for loss functions other than the 
> mean-squared error. The TreeBoosting refinement can help mitigate this issue 
> by modifying terminal node prediction values so that those predictions would 
> directly minimize the actual loss function. Although this still doesn't 
> change the fact that the tree splits were done through variance reduction, it 
> should still lead to improvement in gradient estimations, and thus better 
> performance.
> The details of this can be found in the R vignette. This paper also shows how 
> to refine the terminal node predictions.
> http://www.saedsayad.com/docs/gbm2.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to