Faraz, take a look at the discussion of this issue here: 
http://parrt.cs.usfca.edu/doc/rf-importance/index.html

Best,
Jeremiah
=========================================
Jeremiah W. Johnson, Ph.D
Asst. Professor of Data Science
Program Coordinator, B.S. in Analytics & Data Science
University of New Hampshire
Manchester, NH 03101
https://www.linkedin.com/in/jwjohnson314<https://linkedin.com/in/jwjohnson314>

From: scikit-learn 
<scikit-learn-bounces+jeremiah.johnson=unh....@python.org<mailto:scikit-learn-bounces+jeremiah.johnson=unh....@python.org>>
 on behalf of "Niyaghi, Faraz" 
<niyag...@oregonstate.edu<mailto:niyag...@oregonstate.edu>>
Reply-To: Scikit-learn mailing list 
<scikit-learn@python.org<mailto:scikit-learn@python.org>>
Date: Friday, May 4, 2018 at 7:10 PM
To: "scikit-learn@python.org<mailto:scikit-learn@python.org>" 
<scikit-learn@python.org<mailto:scikit-learn@python.org>>
Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature 
Importance

Caution - External Email
________________________________
Greetings,

This is Faraz Niyaghi from Oregon State University. I research on variable 
selection using random forest. To the best of my knowledge, there is a  
difference between scikit-learn's and Breiman's definition of feature 
importance. Breiman uses out of bag (oob) cases to calculate feature importance 
but scikit-learn doesn't. I was wondering: 1) why are they different? 2) can 
they result in very different rankings of features?

Here are the definitions I found on the web:

Breiman: "In every tree grown in the forest, put down the oob cases and count 
the number of votes cast for the correct class. Now randomly permute the values 
of variable m in the oob cases and put these cases down the tree. Subtract the 
number of votes for the correct class in the variable-m-permuted oob data from 
the number of votes for the correct class in the untouched oob data. The 
average of this number over all trees in the forest is the raw importance score 
for variable m."
Link: 
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.stat.berkeley.edu_-7Ebreiman_RandomForests_cc-5Fhome.htm&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=WaBYWZLyPqs-hxiuv69tRl2SEDRoobauBH-o9gWPiHE&e=>

scikit-learn: " The relative rank (i.e. depth) of a feature used as a decision 
node in a tree can be used to assess the relative importance of that feature 
with respect to the predictability of the target variable. Features used at the 
top of the tree contribute to the final prediction decision of a larger 
fraction of the input samples. The expected fraction of the samples they 
contribute to can thus be used as an estimate of the relative importance of the 
features."
Link: 
http://scikit-learn.org/stable/modules/ensemble.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_ensemble.html&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=NBDOYrJrlTE31cW1foTK9FE4A0F3NLeD1CNubjAdLRg&e=>

Thank you for reading this email. Please let me know your thoughts.

Cheers,
Faraz.

Faraz Niyaghi

Ph.D. Candidate, Department of Statistics
Oregon State University
Corvallis, OR
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Reply via email to