I believe your random variable by chance have some predictive power. In R, use Information package and check information value of that randomly created variable. If it is > 0.05 then it has good predictive power. On Tue, Apr 18, 2017 at 7:47 AM Olga Lyashevska <o.lyashevsk...@gmail.com> wrote:
> Hi, > > I would like to understand how feature importances are calculated in > gradient boosting regression. > > I know that these are the relevant functions: > > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L1165 > > https://github.com/scikit-learn/scikit-learn/blob/fc2f24927fc37d7e42917369f17de045b14c59b5/sklearn/tree/_tree.pyx#L1056 > > From the literature and elsewhere I understand that Gini impurity is > calculated. What is this exactly and how does it relate to 'gain' vs > 'frequency' implemented in XGBoost? > http://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html > > My problem is that when I fit exactly same model in sklearn and gbm (R > package) I get different variable importance plots. One of the variables > which was generated randomly (keeping all other variables real) appears > to be very important in sklearn and very unimportant in gbm. How is this > possible that completely random variable gets the highest importance? > > > Many thanks, > Olga > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn