Github user MechCoder commented on the issue:

    https://github.com/apache/spark/pull/13650
  
    @yanboliang Sorry for the wrong delay! Hope you are still here.
    
    1. The term variance in predictions is ambiguous and a bit misleading. Let 
us say that we have the original data generating distribution, the variance in 
prediction for a decision tree describes how much the prediction changes from 
one decision tree to another fit on the subsample of the data. As we know, this 
"variance in predictions" is high for a decision tree and reduces to zero for a 
random forest, (assuming there a huge number of trees and the trees are 
uncorrelated). I have updated the PR title to reflect this.
    
    2. No, the paper as such is not widely cited. Also what @sethah describes 
is correct. This approach is picking a random tree with equal probability, and 
use the expected variance as got by that. However, the conditional distribution 
of Y|X is NOT the mean of the conditional distribution of Y|X of each tree. 
That is P(Y | X) != (P(Y_1 | x) + P(Y_2 | x) + .. P(Y_n | x)) / n. It is only 
that the expectation of Y|x is given by the mean of the expectation of the 
individual trees. The correct way of deriving the conditional CDF of Y | x is 
given in the well-cited paper 
(http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf) . 
    
    3. However, the formula derived in the paper is the same as got by the 
weighted variance with weights given to the target variable in the training 
data as defined in formula 5 of 
http://www.jmlr.org/papers/volume7/meinshausen06a/meinshausen06a.pdf . I have 
verified it on synthetic data in a notebook over here 
(https://github.com/MechCoder/Notebooks/blob/master/Conditional_variances.ipynb)
 .
    
    I have spent more time then I would have initially expected on this Pull 
Request and I'm willing to do anything more that is required to merge.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to