Github user vlad17 commented on the issue:
https://github.com/apache/spark/pull/14547
@sethah Was that coupling not already there beforehand? I didn't really
change any of the implementation class' interfaces, I just added the Bernoulli
impurity to the existing Impurity framework, which itself couples the Impurity
class with the ImpurityAggregator, that necessarily returns an
ImpurityCalculator, which makes the prediction. It seems like the existing
design is already doing the coupling.
As for the interface making this coupling explicit, yes, I completely agree
I'm doing that. But I think this is a good thing.
1. The coupled loss functions / splitting impurity is the whole point of
tree boost. The papers themselves say to construct intermediate trees to
minimize loss. They only offer using other impurity measures for ease of
implementation. XGBoost, for instance, splits on (an approximation of) the
losses directly.
2. The fact that the underlying impurity/predictions are all done by the
same class (though not my choice), is also probably better from an
implementation perspective. Both need to gather summary statistics about each
leaf's partition of the data, so it's easiest to just do it in one place.
3. I don't think we're giving up the "Decoupled version" either. If we so
choose to in the future, setting impurity to "variance" but loss function to
"absolute" can use a new ImpurityAggregator that offers the variance for
splitting but median for predicting.
My goal with this PR was to make as minimal a change as possible (it's
mostly an API change introducing the loss-based impurity, which also makes
loss-based terminal node predictions). I'm not trying to change the GBT design
here at all (though if it appears to be the case because of something I'm
missing, please let me know).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]