Re: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance

Guillaume Lemaître Sat, 05 May 2018 01:36:49 -0700

+1 on the post pointed out by Jeremiah.

On 5 May 2018 at 02:08, Johnson, Jeremiah <[email protected]> wrote:


> Faraz, take a look at the discussion of this issue here:
> http://parrt.cs.usfca.edu/doc/rf-importance/index.html
>
> Best,
> Jeremiah
> =========================================
> Jeremiah W. Johnson, Ph.D
> Asst. Professor of Data Science
> Program Coordinator, B.S. in Analytics & Data Science
> University of New Hampshire
> Manchester, NH 03101
> https://www.linkedin.com/in/jwjohnson314
> <https://linkedin.com/in/jwjohnson314>
>
> From: scikit-learn <scikit-learn-bounces+jeremiah.johnson=unh.edu@
> python.org> on behalf of "Niyaghi, Faraz" <[email protected]>
> Reply-To: Scikit-learn mailing list <[email protected]>
> Date: Friday, May 4, 2018 at 7:10 PM
> To: "[email protected]" <[email protected]>
> Subject: [scikit-learn] Breiman vs. scikit-learn definition of Feature
> Importance
>
> *Caution - External Email*
> ------------------------------
> Greetings,
>
> This is Faraz Niyaghi from Oregon State University. I research on variable
> selection using random forest. To the best of my knowledge, there is a
> difference between scikit-learn's and Breiman's definition of feature
> importance. Breiman uses out of bag (oob) cases to calculate feature
> importance but scikit-learn doesn't. I was wondering: 1) why are they
> different? 2) can they result in very different rankings of features?
>
> Here are the definitions I found on the web:
>
> *Breiman:* "In every tree grown in the forest, put down the oob cases and
> count the number of votes cast for the correct class. Now randomly permute
> the values of variable m in the oob cases and put these cases down the
> tree. Subtract the number of votes for the correct class in the
> variable-m-permuted oob data from the number of votes for the correct class
> in the untouched oob data. The average of this number over all trees in the
> forest is the raw importance score for variable m."
> Link: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.stat.berkeley.edu_-7Ebreiman_RandomForests_cc-5Fhome.htm&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=WaBYWZLyPqs-hxiuv69tRl2SEDRoobauBH-o9gWPiHE&e=>
>
> *scikit-learn: *" The relative rank (i.e. depth) of a feature used as a
> decision node in a tree can be used to assess the relative importance of
> that feature with respect to the predictability of the target variable.
> Features used at the top of the tree contribute to the final prediction
> decision of a larger fraction of the input samples. The expected fraction
> of the samples they contribute to can thus be used as an estimate of the
> relative importance of the features."
> Link: http://scikit-learn.org/stable/modules/ensemble.html
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__scikit-2Dlearn.org_stable_modules_ensemble.html&d=DwMFaQ&c=c6MrceVCY5m5A_KAUkrdoA&r=hQNTLb4Jonm4n54VBW80WEzIAaqvTOcTEjhIkrRJWXo&m=-1UkxOBfdCgjt0jth2-l9X5IHT-470kGy1VfzniEB4U&s=NBDOYrJrlTE31cW1foTK9FE4A0F3NLeD1CNubjAdLRg&e=>
>
> Thank you for reading this email. Please let me know your thoughts.
>
> Cheers,
> Faraz.
>
> Faraz Niyaghi
>
> Ph.D. Candidate, Department of Statistics
> Oregon State University
> Corvallis, OR
>
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>


-- 
Guillaume Lemaitre
INRIA Saclay - Parietal team
Center for Data Science Paris-Saclay
https://glemaitre.github.io/

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Breiman vs. scikit-learn definition of Feature Importance

Reply via email to