[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

erikerlandson Mon, 17 Sep 2018 15:18:02 -0700

Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218245670
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
         val leftImpurity = leftImpurityCalculator.calculate() // Note: This 
equals 0 if count = 0
         val rightImpurity = rightImpurityCalculator.calculate()
     
    -    val leftWeight = leftCount / totalCount.toDouble
    -    val rightWeight = rightCount / totalCount.toDouble
    +    val gain = metadata.impurity match {
    +      case imp if (imp.isTestStatistic) =>
    +        // For split quality measures based on a test-statistic, run the 
test on the
    +        // left and right sub-populations to get a p-value for the null 
hypothesis
    +        val pval = imp.calculate(leftImpurityCalculator, 
rightImpurityCalculator)
    +        // Transform the test statistic p-val into a larger-is-better gain 
value
    +        Impurity.pValToGain(pval)
    +
    +      case _ =>
    +        // Default purity-gain logic:
    +        // measure the weighted decrease in impurity from parent to the 
left and right
    +        val leftWeight = leftCount / totalCount.toDouble
    +        val rightWeight = rightCount / totalCount.toDouble
    +
    +        impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
    +    }
     
    -    val gain = impurity - leftWeight * leftImpurity - rightWeight * 
rightImpurity
    +    // If the impurity being used is a test statistic p-val, apply a 
standard transform into
    +    // a larger-is-better gain value for the minimum-gain threshold
    +    val minGain =
    +      if (metadata.impurity.isTestStatistic) 
Impurity.pValToGain(metadata.minInfoGain)
    +      else metadata.minInfoGain
    --- End diff --
    
    The main issue I recall was that all of the existing metrics assume some 
kind of "larger is better" gain, and p-values are "smaller is better."  I'll 
take another pass over it and see if I can push that distinction down so it 
doesn't require exposing new methods.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13440: [SPARK-15699] [ML] Implement a Chi-Squared test s...

Reply via email to