Greetings,

This is Faraz Niyaghi from Oregon State University. I research on variable
selection using random forest. To the best of my knowledge, there is a
difference between scikit-learn's and Breiman's definition of feature
importance. Breiman uses out of bag (oob) cases to calculate feature
importance but scikit-learn doesn't. I was wondering: 1) why are they
different? 2) can they result in very different rankings of features?

Here are the definitions I found on the web:

*Breiman:* "In every tree grown in the forest, put down the oob cases and
count the number of votes cast for the correct class. Now randomly permute
the values of variable m in the oob cases and put these cases down the
tree. Subtract the number of votes for the correct class in the
variable-m-permuted oob data from the number of votes for the correct class
in the untouched oob data. The average of this number over all trees in the
forest is the raw importance score for variable m."
Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

*scikit-learn: *" The relative rank (i.e. depth) of a feature used as a
decision node in a tree can be used to assess the relative importance of
that feature with respect to the predictability of the target variable.
Features used at the top of the tree contribute to the final prediction
decision of a larger fraction of the input samples. The expected fraction
of the samples they contribute to can thus be used as an estimate of the
relative importance of the features."
Link: http://scikit-learn.org/stable/modules/ensemble.html

Thank you for reading this email. Please let me know your thoughts.

Cheers,
Faraz.

Faraz Niyaghi

Ph.D. Candidate, Department of Statistics
Oregon State University
Corvallis, OR
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to