Dear Vincent, On 6 February 2014 17:46, Vincent Arel <vincent.a...@gmail.com> wrote: > Hi all, > > Gilles Louppe[1] suggests that feature importance in random forest > classifiers is calculated using the algorithm of Breiman (1984). I > imagine this is the same as formula 10.42 on page 368 of Hastie et > al.[2]. This formula only has a sum, a squared term and an indicator, > so I'm trying to figure out why I get negative elements in the > feature_importance_ array when I use sample weights.
Sorry, but my answer on Stack Overflow was a bit misleading on this topic. Breiman's book from 1984 does not discuss variable importances in random forests and/or in boosting (since those algorithms were formulated 10 to 20 years later). The only definition of variable importances in Breiman's book is Definition 5.9 which defines the importance in terms of surrogate splits, which is very different from formula 10.42. (In this regard, the citation in Hastie regarding 10.42 is wrong.) In Scikit-Learn, variable importances are defined as in Breiman's papers on Random Forests (from 2001 and 2002). It is defined as the sum of impurity decreases over all nodes where the variable is used, averaged over all trees in the forest (see Equation (2) from [1]). They should therefore all be positive. (In addition however, variable importances are normalised, such that they sum to 1.0.) [1]: http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf > My dataset has 2 labels that are highly unbalanced (+/- 1% of 1s for > 99% 0s) and it's too large for in-memory processing on my laptop, so I > took a drew balanced subsample and would like to use sample_weights to > adjust accordingly. Based on prior knowledge, I expect that some of > the features with large negative values on importance are in fact > important. > > I link to example data below[3] (44Mb) to use with some code I paste below. > > Any thoughts? Help would be greatly appreciated! > > Vincent > > > import pickle > from sklearn.ensemble import RandomForestClassifier > > f = open('diagnostic.pickle', 'r') > dat = pickle.load(f) > f.close() > > clf = RandomForestClassifier() > clf.fit(dat['X'], dat['y'], sample_weight=dat['w']) > clf.feature_importances_ > I can confirm the bug. Feature importances are all positive when not using the sample weights but become negative using data['w']... I am looking into it. > > [1]: > http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined > [2]: > http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf > [3]: http://umich.edu/~varel/diagnostic.pickle > > ------------------------------------------------------------------------------ > Managing the Performance of Cloud-Based Applications > Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. > Read the Whitepaper. > http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general