Github user MechCoder commented on the issue:
https://github.com/apache/spark/pull/13650
@yanboliang Sorry for the wrong delay! Hope you are still here.
1. The term variance in predictions is ambiguous and a bit misleading. Let
us say that we have the original data generating distribution, the variance in
prediction for a decision tree describes how much the prediction changes from
one decision tree to another fit on the subsample of the data. As we know, this
"variance in predictions" is high for a decision tree and reduces to zero for a
random forest, (assuming there a huge number of trees and the trees are
uncorrelated). I have updated the PR title to reflect this.
2. No, the paper as such is not widely cited. Also what @sethah describes
is correct. This approach is picking a random tree with equal probability, and
use the expected variance as got by that. However, the conditional distribution
of Y|X is NOT the mean of the conditional distribution of Y|X of each tree.
That is P(Y | X) != (P(Y_1 | x) + P(Y_2 | x) + .. P(Y_n | x)) / n. It is only
that the expectation of Y|x is given by the mean of the expectation of the
individual trees. The correct way of deriving the conditional CDF of Y | x is
given in the well-cited paper
(http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) .
3. However, the formula derived in the paper is the same as got by the
weighted variance with weights given to the target variable in the training
data as defined in formula 5 of
http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf . I have
verified it on synthetic data in a notebook over here
(https://github.com/MechCoder/Notebooks/blob/master/Conditional_variances.ipynb)
.
I have spent more time then I would have initially expected on this Pull
Request and I'm willing to do anything more that is required to merge.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]