Re: [Scikit-learn-general] Negative feature_importances in random forest with sample_weights

Gilles Louppe Thu, 06 Feb 2014 09:20:43 -0800

Dear Vincent,

On 6 February 2014 17:46, Vincent Arel <vincent.a...@gmail.com> wrote:
> Hi all,
>
> Gilles Louppe[1] suggests that feature importance in random forest
> classifiers is calculated using the algorithm of Breiman (1984). I
> imagine this is the same as formula 10.42 on page 368 of Hastie et
> al.[2]. This formula only has a sum,  a squared term and an indicator,
> so I'm trying to figure out why I get negative elements in the
> feature_importance_ array when I use sample weights.


Sorry, but my answer on Stack Overflow was a bit misleading on this
topic. Breiman's book from 1984 does not discuss variable importances
in random forests and/or in boosting (since those algorithms were
formulated 10 to 20 years later). The only definition of variable
importances in Breiman's book is Definition 5.9 which defines the
importance in terms of surrogate splits, which is very different from
formula 10.42. (In this regard, the citation in Hastie regarding 10.42
is wrong.)

In Scikit-Learn, variable importances are defined as in Breiman's
papers on Random Forests (from 2001 and 2002). It is defined as the
sum of impurity decreases over all nodes where the variable is used,
averaged over all trees in the forest (see Equation (2) from [1]).
They should therefore all be positive. (In addition however, variable
importances are normalised, such that they sum to 1.0.)

[1]: 
http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf

> My dataset has 2 labels that are highly unbalanced (+/- 1% of 1s for
> 99% 0s) and it's too large for in-memory processing on my laptop, so I
> took a drew balanced subsample and would like to use sample_weights to
> adjust accordingly. Based on prior knowledge, I expect that some of
> the features with large negative values on importance are in fact
> important.
>
> I link to example data below[3] (44Mb) to use with some code I paste below.
>
> Any thoughts? Help would be greatly appreciated!
>
> Vincent
>
>
> import pickle
> from sklearn.ensemble import  RandomForestClassifier
>
> f = open('diagnostic.pickle', 'r')
> dat = pickle.load(f)
> f.close()
>
> clf = RandomForestClassifier()
> clf.fit(dat['X'], dat['y'], sample_weight=dat['w'])
> clf.feature_importances_
>

I can confirm the bug. Feature importances are all positive when not
using the sample weights but become negative using data['w']... I am
looking into it.

>
> [1]: 
> http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined
> [2]: 
> http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
> [3]: http://umich.edu/~varel/diagnostic.pickle
>
> ------------------------------------------------------------------------------
> Managing the Performance of Cloud-Based Applications
> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
> Read the Whitepaper.
> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Negative feature_importances in random forest with sample_weights

Reply via email to