Hi all,

Gilles Louppe[1] suggests that feature importance in random forest
classifiers is calculated using the algorithm of Breiman (1984). I
imagine this is the same as formula 10.42 on page 368 of Hastie et
al.[2]. This formula only has a sum,  a squared term and an indicator,
so I’m trying to figure out why I get negative elements in the
feature_importance_ array when I use sample weights.

My dataset has 2 labels that are highly unbalanced (+/- 1% of 1s for
99% 0s) and it’s too large for in-memory processing on my laptop, so I
took a drew balanced subsample and would like to use sample_weights to
adjust accordingly. Based on prior knowledge, I expect that some of
the features with large negative values on importance are in fact
important.

I link to example data below[3] (44Mb) to use with some code I paste below.

Any thoughts? Help would be greatly appreciated!

Vincent


import pickle
from sklearn.ensemble import  RandomForestClassifier

f = open('diagnostic.pickle', 'r')
dat = pickle.load(f)
f.close()

clf = RandomForestClassifier()
clf.fit(dat['X'], dat['y'], sample_weight=dat['w'])
clf.feature_importances_


[1]: 
http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined
[2]: http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
[3]: http://umich.edu/~varel/diagnostic.pickle

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to