[scikit-learn] impurity criterion in gradient boosted regression trees

Olga Lyashevska Tue, 09 May 2017 09:10:29 -0700

Hi all,

I am trying to understand differences in feature importance plotsobtained with R package gbm and sklearn. Having compared bothimplementation side by side it seems that the models are fairly similar,however feature importance plots are rather distinct.

R uses empirical improvement in squared error as it is described inFriedmans's "Greedy Function Approximation" paper (eq. 44, 45).

sklearn (as far as I could see it in the code) uses the weightedreduction in node purity. How exactly is this calculated? Is it a giniindex? Is there a reference?


I found this, but I find this hard to follow:
https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056

I have also seen a post by Matthew Drury on stack exchange:https://stats.stackexchange.com/questions/162162/relative-variable-importance-for-boosting


Many thanks,
Olga



_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] impurity criterion in gradient boosted regression trees

Reply via email to