[scikit-learn] feature importance calculation in gradient boosting

Olga Lyashevska Tue, 18 Apr 2017 07:49:48 -0700

Hi,

I would like to understand how feature importances are calculated ingradient boosting regression.


I know that these are the relevant functions:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165
https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056

From the literature and elsewhere I understand that Gini impurity iscalculated. What is this exactly and how does it relate to 'gain' vs'frequency' implemented in XGBoost?

http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html

My problem is that when I fit exactly same model in sklearn and gbm (Rpackage) I get different variable importance plots. One of the variableswhich was generated randomly (keeping all other variables real) appearsto be very important in sklearn and very unimportant in gbm. How is thispossible that completely random variable gets the highest importance?



Many thanks,
Olga
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] feature importance calculation in gradient boosting

Reply via email to