Dear Gilles,

sorry to jump into that discussion, but it raised my interest..
In the R RandomForest package, MeanDecreaseGini can be calculated.


Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale.

Please find attached the variable importance as compute by scikit-learn's
RF & R's RF.
(See attached file: RF_sklearn.png)
(See attached file: RF_R.png)


In the R case, I only had 10 features, but in the sklearn case, there were
a few more.
Of course, one cannot compare the absolute numbers of
VariableImportance/MeanDecreaseGini,  but I'm astonished to get that large
values in the R implementation.


Cheers & Thanks,
Paul

>
> Hi Olivier,
>
> There are indeed several ways to get feature "importances". As
> often, there is no strict consensus about what this word means.
>
> In our case, we implement the importance as described in [1] (often
> cited, but unfortunately rarely read...). It is sometimes called
> "gini importance" or "mean decrease impurity" and is defined as the
> total decrease in node impurity (weighted by the probability of
> reaching that node (which is approximated by the proportion of
> samples)) averaged over all trees of the ensemble.
>
> The other measure is the one you describe. It is sometimes called
> "mean decrease accuracy". It is more intensive to compute since it
> requires (repeated) random  permutations of each feature. It also
> works only with bootstrapping.
>
> Note that both measures are available in the randomForest R package.
>
> [1]: Breiman, Friedman, "Classification and regression trees", 1984.
>
> I'll reply on SO as well.
>
> Hope this helps,
>
> Gilles
>
>

> On 4 April 2013 21:35, Peter Prettenhofer <[email protected]
> > wrote:
> I posted a brief description of the algorithm. The method that we
> implement is briefly described in ESLII. Gilles is the expert here,
> he can give more details on the issue.
>

> 2013/4/4 Olivier Grisel <[email protected]>
> The variable importance in scikit-learn's implementation of random
> forest is based on the proportion of samples that were classified by
> the feature at some point in one of the decision trees evaluation.
>
> http://scikit-learn.org/stable/modules/ensemble.html#feature-
> importance-evaluation
>
> This method seems different from the OOB based method of Breiman 2001
> (section 10):
>
> http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
>
> Is there any reference for the method implemented in the scikit?
>
> Here is the original Stack Overflow question:
>
> http://stackoverflow.com/questions/15810339/how-are-feature-
> importances-in-randomforestclassifier-determined/15811003?
> noredirect=1#comment22487062_15811003
>
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
>
>
------------------------------------------------------------------------------

> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>

>
> --
> Peter Prettenhofer
>
>
------------------------------------------------------------------------------

> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

>
------------------------------------------------------------------------------

> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


This message and any attachment are confidential and may be privileged or
otherwise protected from disclosure. If you are not the intended recipient,
you must not copy this message or attachment or disclose the contents to
any other person. If you have received this transmission in error, please
notify the sender immediately and delete the message and any attachment
from your system. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not accept liability for any omissions or errors in this
message which may arise as a result of E-Mail-transmission or for damages
resulting from any unauthorized changes of the content of this message and
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
subsidiaries do not guarantee that this message is free of viruses and does
not accept liability for any damages caused by any virus transmitted
therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French,
Spanish and Portuguese versions of this disclaimer.

<<attachment: RF_sklearn.png>>

<<attachment: RF_R.png>>

------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to