Not sure how it compares in practice, but it's certainly more efficient to rank 
the features by impurity decrease rather than by OOB permutation performance 
you wouldn't need to 
a) compute the OOB performance (an extra pass inference step)
b) permute a feature column and do another inference pass and compare it to a)
c) repeat step b) for each feature column

Another reason would be that Breiman's suggestion wouldn't work that well for 
certain RandomForestClassifier settings in scikit-learn, e.g., setting 
bootstrap=False etc.

If you like to compute the feature importance after Breiman's suggestion, I 
have implemented a simple wrapper function for scikit-learn estimators here:

http://rasbt.github.io/mlxtend/user_guide/evaluate/feature_importance_permutation/#example-1-feature-importance-for-classifiers

Note that it's not using OOB samples but an independent validation set though, 
because it's a general function that should not be restricted to random 
forests. If you have such an independent dataset, it should give more accurate 
results than using OOB samples.

Best,
Sebastian

> On May 4, 2018, at 7:10 PM, Niyaghi, Faraz <niyag...@oregonstate.edu> wrote:
> 
> Greetings,
> 
> This is Faraz Niyaghi from Oregon State University. I research on variable 
> selection using random forest. To the best of my knowledge, there is a  
> difference between scikit-learn's and Breiman's definition of feature 
> importance. Breiman uses out of bag (oob) cases to calculate feature 
> importance but scikit-learn doesn't. I was wondering: 1) why are they 
> different? 2) can they result in very different rankings of features?
> 
> Here are the definitions I found on the web:
> 
> Breiman: "In every tree grown in the forest, put down the oob cases and count 
> the number of votes cast for the correct class. Now randomly permute the 
> values of variable m in the oob cases and put these cases down the tree. 
> Subtract the number of votes for the correct class in the variable-m-permuted 
> oob data from the number of votes for the correct class in the untouched oob 
> data. The average of this number over all trees in the forest is the raw 
> importance score for variable m."
> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> 
> scikit-learn: " The relative rank (i.e. depth) of a feature used as a 
> decision node in a tree can be used to assess the relative importance of that 
> feature with respect to the predictability of the target variable. Features 
> used at the top of the tree contribute to the final prediction decision of a 
> larger fraction of the input samples. The expected fraction of the samples 
> they contribute to can thus be used as an estimate of the relative importance 
> of the features."
> Link: http://scikit-learn.org/stable/modules/ensemble.html
> 
> Thank you for reading this email. Please let me know your thoughts.
> 
> Cheers,
> Faraz.
> 
> Faraz Niyaghi
> 
> Ph.D. Candidate, Department of Statistics
> Oregon State University
> Corvallis, OR
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to