Github user erikerlandson commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13440#discussion_r218997600
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impurity/Gini.scala ---
    @@ -71,6 +71,23 @@ object Gini extends Impurity {
       @Since("1.1.0")
       def instance: this.type = this
     
    +  /**
    +   * :: DeveloperApi ::
    +   * p-values for test-statistic measures, unsupported for [[Gini]]
    +   */
    +  @Since("2.2.0")
    +  @DeveloperApi
    +  def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): 
Double =
    --- End diff --
    
    @srowen @willb 
    I cached the design of the metrics back in. In general, `Impurity` already 
uses methods that are only defined on certain impurity sub-classes, and so this 
new method does not change that situation.
    
    My take on the "problem" is that the existing measures are all based on a 
localized concept of "purity" (or impurity) that can be calculated using _only_ 
the data at a single node. Splitting based on statistical tests (p-values) 
breaks that model, since it is making use of a more generalized concept of 
split quality that requires the sample populations of both children from a 
candidate split. A maximally general signature would probably involve the 
parent and both children.
    
    Another kink in the current design is that `ImpurityCalculator` is 
essentially parallel to `Impurity`, and in fact 
`ImpurityCalculator#calculate()` is how impurity measures are currently 
requested. `Impurity` seems somewhat redundant, and might be factored out in 
favor of `ImpurityCalculator`. The current signature `calculate()` might be 
generalized into a more inclusive concept of split quality that expects to make 
use of {parent,left,right}.
    
    Calls to `calculate()` are not very wide-spread but threading that change 
through is outside the scope of this particular PR. If people are interested in 
that kind of refactoring I could look into it in the near future but probably 
not in the next couple weeks.
    
    That kind of change would also be API breaking and so a good target for 3.0


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to