+1 on the post pointed out by Jeremiah. On 5 May 2018 at 02:08, Johnson, Jeremiah <jeremiah.john...@unh.edu> wrote:
> Faraz, take a look at the discussion of this issue here: > http://parrt.cs.usfca.edu/doc/rf-importance/index.html > > Best, > Jeremiah > ========================================= > Jeremiah W. Johnson, Ph.D > Asst. Professor of Data Science > Program Coordinator, B.S. in Analytics & Data Science > University of New Hampshire > Manchester, NH 03101 > https://www.linkedin.com/in/jwjohnson314 > <https://linkedin.com/in/jwjohnson314> > > From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=unh.edu@ > python.org> on behalf of "Niyaghi, Faraz" <niyag...@oregonstate.edu> > Reply-To: Scikit-learn mailing list <scikit-learn@python.org> > Date: Friday, May 4, 2018 at 7:10 PM > To: "scikit-learn@python.org" <scikit-learn@python.org> > Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature > Importance > > *Caution - External Email* > ------------------------------ > Greetings, > > This is Faraz Niyaghi from Oregon State University. I research on variable > selection using random forest. To the best of my knowledge, there is a > difference between scikit-learn's and Breiman's definition of feature > importance. Breiman uses out of bag (oob) cases to calculate feature > importance but scikit-learn doesn't. I was wondering: 1) why are they > different? 2) can they result in very different rankings of features? > > Here are the definitions I found on the web: > > *Breiman:* "In every tree grown in the forest, put down the oob cases and > count the number of votes cast for the correct class. Now randomly permute > the values of variable m in the oob cases and put these cases down the > tree. Subtract the number of votes for the correct class in the > variable-m-permuted oob data from the number of votes for the correct class > in the untouched oob data. The average of this number over all trees in the > forest is the raw importance score for variable m." > Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.stat.berkeley.edu_-7Ebreiman_RandomForests_cc-5Fhome.htm&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=WaBYWZLyPqs-hxiuv69tRl2SEDRoobauBH-o9gWPiHE&e=> > > *scikit-learn: *" The relative rank (i.e. depth) of a feature used as a > decision node in a tree can be used to assess the relative importance of > that feature with respect to the predictability of the target variable. > Features used at the top of the tree contribute to the final prediction > decision of a larger fraction of the input samples. The expected fraction > of the samples they contribute to can thus be used as an estimate of the > relative importance of the features." > Link: http://scikit-learn.org/stable/modules/ensemble.html > <https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_ensemble.html&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=NBDOYrJrlTE31cW1foTK9FE4A0F3NLeD1CNubjAdLRg&e=> > > Thank you for reading this email. Please let me know your thoughts. > > Cheers, > Faraz. > > Faraz Niyaghi > > Ph.D. Candidate, Department of Statistics > Oregon State University > Corvallis, OR > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > -- Guillaume Lemaitre INRIA Saclay - Parietal team Center for Data Science Paris-Saclay https://glemaitre.github.io/
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn