Github user erikerlandson commented on a diff in the pull request:
https://github.com/apache/spark/pull/13440#discussion_r218245670
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -670,14 +670,32 @@ private[spark] object RandomForest extends Logging {
val leftImpurity = leftImpurityCalculator.calculate() // Note: This
equals 0 if count = 0
val rightImpurity = rightImpurityCalculator.calculate()
- val leftWeight = leftCount / totalCount.toDouble
- val rightWeight = rightCount / totalCount.toDouble
+ val gain = metadata.impurity match {
+ case imp if (imp.isTestStatistic) =>
+ // For split quality measures based on a test-statistic, run the
test on the
+ // left and right sub-populations to get a p-value for the null
hypothesis
+ val pval = imp.calculate(leftImpurityCalculator,
rightImpurityCalculator)
+ // Transform the test statistic p-val into a larger-is-better gain
value
+ Impurity.pValToGain(pval)
+
+ case _ =>
+ // Default purity-gain logic:
+ // measure the weighted decrease in impurity from parent to the
left and right
+ val leftWeight = leftCount / totalCount.toDouble
+ val rightWeight = rightCount / totalCount.toDouble
+
+ impurity - leftWeight * leftImpurity - rightWeight * rightImpurity
+ }
- val gain = impurity - leftWeight * leftImpurity - rightWeight *
rightImpurity
+ // If the impurity being used is a test statistic p-val, apply a
standard transform into
+ // a larger-is-better gain value for the minimum-gain threshold
+ val minGain =
+ if (metadata.impurity.isTestStatistic)
Impurity.pValToGain(metadata.minInfoGain)
+ else metadata.minInfoGain
--- End diff --
The main issue I recall was that all of the existing metrics assume some
kind of "larger is better" gain, and p-values are "smaller is better." I'll
take another pass over it and see if I can push that distinction down so it
doesn't require exposing new methods.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]