Hi all, Gilles Louppe[1] suggests that feature importance in random forest classifiers is calculated using the algorithm of Breiman (1984). I imagine this is the same as formula 10.42 on page 368 of Hastie et al.[2]. This formula only has a sum, a squared term and an indicator, so I’m trying to figure out why I get negative elements in the feature_importance_ array when I use sample weights.
My dataset has 2 labels that are highly unbalanced (+/- 1% of 1s for 99% 0s) and it’s too large for in-memory processing on my laptop, so I took a drew balanced subsample and would like to use sample_weights to adjust accordingly. Based on prior knowledge, I expect that some of the features with large negative values on importance are in fact important. I link to example data below[3] (44Mb) to use with some code I paste below. Any thoughts? Help would be greatly appreciated! Vincent import pickle from sklearn.ensemble import RandomForestClassifier f = open('diagnostic.pickle', 'r') dat = pickle.load(f) f.close() clf = RandomForestClassifier() clf.fit(dat['X'], dat['y'], sample_weight=dat['w']) clf.feature_importances_ [1]: http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined [2]: http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf [3]: http://umich.edu/~varel/diagnostic.pickle ------------------------------------------------------------------------------ Managing the Performance of Cloud-Based Applications Take advantage of what the Cloud has to offer - Avoid Common Pitfalls. Read the Whitepaper. http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general