Github user erikerlandson commented on a diff in the pull request:
https://github.com/apache/spark/pull/13440#discussion_r218997600
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
@@ -71,6 +71,23 @@ object Gini extends Impurity {
@Since("1.1.0")
def instance: this.type = this
+ /**
+ * :: DeveloperApi ::
+ * p-values for test-statistic measures, unsupported for [[Gini]]
+ */
+ @Since("2.2.0")
+ @DeveloperApi
+ def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator):
Double =
--- End diff --
@srowen @willb
I cached the design of the metrics back in. In general, `Impurity` already
uses methods that are only defined on certain impurity sub-classes, and so this
new method does not change that situation.
My take on the "problem" is that the existing measures are all based on a
localized concept of "purity" (or impurity) that can be calculated using _only_
the data at a single node. Splitting based on statistical tests (p-values)
breaks that model, since it is making use of a more generalized concept of
split quality that requires the sample populations of both children from a
candidate split. A maximally general signature would probably involve the
parent and both children.
Another kink in the current design is that `ImpurityCalculator` is
essentially parallel to `Impurity`, and in fact
`ImpurityCalculator#calculate()` is how impurity measures are currently
requested. `Impurity` seems somewhat redundant, and might be factored out in
favor of `ImpurityCalculator`. The current signature `calculate()` might be
generalized into a more inclusive concept of split quality that expects to make
use of {parent,left,right}.
Calls to `calculate()` are not very wide-spread but threading that change
through is outside the scope of this particular PR. If people are interested in
that kind of refactoring I could look into it in the near future but probably
not in the next couple weeks.
That kind of change would also be API breaking and so a good target for 3.0
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]