Dear Gilles, sorry to jump into that discussion, but it raised my interest.. In the R RandomForest package, MeanDecreaseGini can be calculated.
Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale. Please find attached the variable importance as compute by scikit-learn's RF & R's RF. (See attached file: RF_sklearn.png) (See attached file: RF_R.png) In the R case, I only had 10 features, but in the sklearn case, there were a few more. Of course, one cannot compare the absolute numbers of VariableImportance/MeanDecreaseGini, but I'm astonished to get that large values in the R implementation. Cheers & Thanks, Paul > > Hi Olivier, > > There are indeed several ways to get feature "importances". As > often, there is no strict consensus about what this word means. > > In our case, we implement the importance as described in [1] (often > cited, but unfortunately rarely read...). It is sometimes called > "gini importance" or "mean decrease impurity" and is defined as the > total decrease in node impurity (weighted by the probability of > reaching that node (which is approximated by the proportion of > samples)) averaged over all trees of the ensemble. > > The other measure is the one you describe. It is sometimes called > "mean decrease accuracy". It is more intensive to compute since it > requires (repeated) random permutations of each feature. It also > works only with bootstrapping. > > Note that both measures are available in the randomForest R package. > > [1]: Breiman, Friedman, "Classification and regression trees", 1984. > > I'll reply on SO as well. > > Hope this helps, > > Gilles > > > On 4 April 2013 21:35, Peter Prettenhofer <[email protected] > > wrote: > I posted a brief description of the algorithm. The method that we > implement is briefly described in ESLII. Gilles is the expert here, > he can give more details on the issue. > > 2013/4/4 Olivier Grisel <[email protected]> > The variable importance in scikit-learn's implementation of random > forest is based on the proportion of samples that were classified by > the feature at some point in one of the decision trees evaluation. > > http://scikit-learn.org/stable/modules/ensemble.html#feature- > importance-evaluation > > This method seems different from the OOB based method of Breiman 2001 > (section 10): > > http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf > > Is there any reference for the method implemented in the scikit? > > Here is the original Stack Overflow question: > > http://stackoverflow.com/questions/15810339/how-are-feature- > importances-in-randomforestclassifier-determined/15811003? > noredirect=1#comment22487062_15811003 > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel > > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. > Reduce network management and security costs.Learn how to hire > the most talented Cisco Certified professionals. Visit the > Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > -- > Peter Prettenhofer > > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. > Reduce network management and security costs.Learn how to hire > the most talented Cisco Certified professionals. Visit the > Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > ------------------------------------------------------------------------------ > Minimize network downtime and maximize team effectiveness. > Reduce network management and security costs.Learn how to hire > the most talented Cisco Certified professionals. Visit the > Employer Resources Portal > http://www.cisco.com/web/learning/employer_resources/index.html > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general This message and any attachment are confidential and may be privileged or otherwise protected from disclosure. If you are not the intended recipient, you must not copy this message or attachment or disclose the contents to any other person. If you have received this transmission in error, please notify the sender immediately and delete the message and any attachment from your system. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept liability for any omissions or errors in this message which may arise as a result of E-Mail-transmission or for damages resulting from any unauthorized changes of the content of this message and any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not guarantee that this message is free of viruses and does not accept liability for any damages caused by any virus transmitted therewith. Click http://www.merckgroup.com/disclaimer to access the German, French, Spanish and Portuguese versions of this disclaimer.
<<attachment: RF_sklearn.png>>
<<attachment: RF_R.png>>
------------------------------------------------------------------------------ Precog is a next-generation analytics platform capable of advanced analytics on semi-structured data. The platform includes APIs for building apps and a phenomenal toolset for data science. Developers can use our toolset for easy data analysis & visualization. Get a free account! http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
