Github user sethah commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13650#discussion_r71028688
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/RandomForestRegressor.scala 
---
    @@ -168,15 +173,37 @@ class RandomForestRegressionModel private[ml] (
       // Note: We may add support for weights (based on tree performance) 
later on.
       private lazy val _treeWeights: Array[Double] = 
Array.fill[Double](_trees.length)(1.0)
     
    +  @Since("2.1.0")
    +  /** @group getParam */
    +  def setVarianceCol(value: String): this.type = set(varianceCol, value)
    +
       @Since("1.4.0")
       override def treeWeights: Array[Double] = _treeWeights
     
    +  private def predictVariance(features: Vector): Double = {
    --- End diff --
    
    I took a closer look at the methodology in the paper. If you view the 
prediction of a random forest as the mean of several random variables (the 
predictions of each tree) then computing the variance of the prediction would 
require knowledge of the covariances of the individual rvs. We don't track 
that. The methodology in the paper gets around this by assuming that the 
prediction from a random forest is done by randomly selecting a single tree and 
then using that tree's prediction as the overall prediction. So, AFAICT, the 
methodology in the paper does not apply here, though it might be "better than 
nothing", but then again it might be worse to mislead users. I'm curious to 
hear others' thoughts on this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to