[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

vlad17 Fri, 19 Aug 2016 16:31:21 -0700

Github user vlad17 commented on the issue:

    https://github.com/apache/spark/pull/14547
  
    @sethah Was that coupling not already there beforehand? I didn't really 
change any of the implementation class' interfaces, I just added the Bernoulli 
impurity to the existing Impurity framework, which itself couples the Impurity 
class with the ImpurityAggregator, that necessarily returns an 
ImpurityCalculator, which makes the prediction. It seems like the existing 
design is already doing the coupling.
    
    As for the interface making this coupling explicit, yes, I completely agree 
I'm doing that. But I think this is a good thing.
    1. The coupled loss functions / splitting impurity is the whole point of 
tree boost. The papers themselves say to construct intermediate trees to 
minimize loss. They only offer using other impurity measures for ease of 
implementation. XGBoost, for instance, splits on (an approximation of) the 
losses directly.
    2. The fact that the underlying impurity/predictions are all done by the 
same class (though not my choice), is also probably better from an 
implementation perspective. Both need to gather summary statistics about each 
leaf's partition of the data, so it's easiest to just do it in one place.
    3. I don't think we're giving up the "Decoupled version" either. If we so 
choose to in the future, setting impurity to "variance" but loss function to 
"absolute" can use a new ImpurityAggregator that offers the variance for 
splitting but median for predicting.
    
    My goal with this PR was to make as minimal a change as possible (it's 
mostly an API change introducing the loss-based impurity, which also makes 
loss-based terminal node predictions). I'm not trying to change the GBT design 
here at all (though if it appears to be the case because of something I'm 
missing, please let me know).




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14547: [SPARK-16718][MLlib] gbm-style treeboost

Reply via email to