Vincent, I identified the bug and opened an issue at https://github.com/scikit-learn/scikit-learn/issues/2835
I will try to fix this in the next days. Sorry for the inconvenience. Gilles On 6 February 2014 18:18, Gilles Louppe <g.lou...@gmail.com> wrote: > Dear Vincent, > > On 6 February 2014 17:46, Vincent Arel <vincent.a...@gmail.com> wrote: >> Hi all, >> >> Gilles Louppe[1] suggests that feature importance in random forest >> classifiers is calculated using the algorithm of Breiman (1984). I >> imagine this is the same as formula 10.42 on page 368 of Hastie et >> al.[2]. This formula only has a sum, a squared term and an indicator, >> so I'm trying to figure out why I get negative elements in the >> feature_importance_ array when I use sample weights. > > Sorry, but my answer on Stack Overflow was a bit misleading on this > topic. Breiman's book from 1984 does not discuss variable importances > in random forests and/or in boosting (since those algorithms were > formulated 10 to 20 years later). The only definition of variable > importances in Breiman's book is Definition 5.9 which defines the > importance in terms of surrogate splits, which is very different from > formula 10.42. (In this regard, the citation in Hastie regarding 10.42 > is wrong.) > > In Scikit-Learn, variable importances are defined as in Breiman's > papers on Random Forests (from 2001 and 2002). It is defined as the > sum of impurity decreases over all nodes where the variable is used, > averaged over all trees in the forest (see Equation (2) from [1]). > They should therefore all be positive. (In addition however, variable > importances are normalised, such that they sum to 1.0.) > > [1]: > http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf > >> My dataset has 2 labels that are highly unbalanced (+/- 1% of 1s for >> 99% 0s) and it's too large for in-memory processing on my laptop, so I >> took a drew balanced subsample and would like to use sample_weights to >> adjust accordingly. Based on prior knowledge, I expect that some of >> the features with large negative values on importance are in fact >> important. >> >> I link to example data below[3] (44Mb) to use with some code I paste below. >> >> Any thoughts? Help would be greatly appreciated! >> >> Vincent >> >> >> import pickle >> from sklearn.ensemble import RandomForestClassifier >> >> f = open('diagnostic.pickle', 'r') >> dat = pickle.load(f) >> f.close() >> >> clf = RandomForestClassifier() >> clf.fit(dat['X'], dat['y'], sample_weight=dat['w']) >> clf.feature_importances_ >> > > I can confirm the bug. Feature importances are all positive when not > using the sample weights but become negative using data['w']... I am > looking into it. > >> >> [1]: >> http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined >> [2]: >> http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf >> [3]: http://umich.edu/~varel/diagnostic.pickle >> >> ------------------------------------------------------------------------------ >> Managing the Performance of Cloud-Based Applications >> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. >> Read the Whitepaper. >> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general