Vincent,

I identified the bug and opened an issue at
https://github.com/scikit-learn/scikit-learn/issues/2835

I will try to fix this in the next days.

Sorry for the inconvenience.

Gilles

On 6 February 2014 18:18, Gilles Louppe <g.lou...@gmail.com> wrote:
> Dear Vincent,
>
> On 6 February 2014 17:46, Vincent Arel <vincent.a...@gmail.com> wrote:
>> Hi all,
>>
>> Gilles Louppe[1] suggests that feature importance in random forest
>> classifiers is calculated using the algorithm of Breiman (1984). I
>> imagine this is the same as formula 10.42 on page 368 of Hastie et
>> al.[2]. This formula only has a sum,  a squared term and an indicator,
>> so I'm trying to figure out why I get negative elements in the
>> feature_importance_ array when I use sample weights.
>
> Sorry, but my answer on Stack Overflow was a bit misleading on this
> topic. Breiman's book from 1984 does not discuss variable importances
> in random forests and/or in boosting (since those algorithms were
> formulated 10 to 20 years later). The only definition of variable
> importances in Breiman's book is Definition 5.9 which defines the
> importance in terms of surrogate splits, which is very different from
> formula 10.42. (In this regard, the citation in Hastie regarding 10.42
> is wrong.)
>
> In Scikit-Learn, variable importances are defined as in Breiman's
> papers on Random Forests (from 2001 and 2002). It is defined as the
> sum of impurity decreases over all nodes where the variable is used,
> averaged over all trees in the forest (see Equation (2) from [1]).
> They should therefore all be positive. (In addition however, variable
> importances are normalised, such that they sum to 1.0.)
>
> [1]: 
> http://papers.nips.cc/paper/4928-understanding-variable-importances-in-forests-of-randomized-trees.pdf
>
>> My dataset has 2 labels that are highly unbalanced (+/- 1% of 1s for
>> 99% 0s) and it's too large for in-memory processing on my laptop, so I
>> took a drew balanced subsample and would like to use sample_weights to
>> adjust accordingly. Based on prior knowledge, I expect that some of
>> the features with large negative values on importance are in fact
>> important.
>>
>> I link to example data below[3] (44Mb) to use with some code I paste below.
>>
>> Any thoughts? Help would be greatly appreciated!
>>
>> Vincent
>>
>>
>> import pickle
>> from sklearn.ensemble import  RandomForestClassifier
>>
>> f = open('diagnostic.pickle', 'r')
>> dat = pickle.load(f)
>> f.close()
>>
>> clf = RandomForestClassifier()
>> clf.fit(dat['X'], dat['y'], sample_weight=dat['w'])
>> clf.feature_importances_
>>
>
> I can confirm the bug. Feature importances are all positive when not
> using the sample weights but become negative using data['w']... I am
> looking into it.
>
>>
>> [1]: 
>> http://stackoverflow.com/questions/15810339/how-are-feature-importances-in-randomforestclassifier-determined
>> [2]: 
>> http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf
>> [3]: http://umich.edu/~varel/diagnostic.pickle
>>
>> ------------------------------------------------------------------------------
>> Managing the Performance of Cloud-Based Applications
>> Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
>> Read the Whitepaper.
>> http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Managing the Performance of Cloud-Based Applications
Take advantage of what the Cloud has to offer - Avoid Common Pitfalls.
Read the Whitepaper.
http://pubads.g.doubleclick.net/gampad/clk?id=121051231&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to