Dear Gilles,

sorry to jump into that discussion, but it raised my interest..
In the R RandomForest package, MeanDecreaseGini can be calculated.


Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale.

Please find attached the variable importance as compute by scikit-learn's 
RF & R's RF.




In the R case, I only had 10 features, but in the sklearn case, there were 
a few more.
Of course, one cannot compare the absolute numbers of 
VariableImportance/MeanDecreaseGini,  but I'm astonished to get that large 
values in the R implementation.


Cheers & Thanks,
Paul



> 
> Hi Olivier,
> 
> There are indeed several ways to get feature "importances". As 
> often, there is no strict consensus about what this word means.
> 
> In our case, we implement the importance as described in [1] (often 
> cited, but unfortunately rarely read...). It is sometimes called 
> "gini importance" or "mean decrease impurity" and is defined as the 
> total decrease in node impurity (weighted by the probability of 
> reaching that node (which is approximated by the proportion of 
> samples)) averaged over all trees of the ensemble.
> 
> The other measure is the one you describe. It is sometimes called 
> "mean decrease accuracy". It is more intensive to compute since it 
> requires (repeated) random  permutations of each feature. It also 
> works only with bootstrapping.
> 
> Note that both measures are available in the randomForest R package. 
> 
> [1]: Breiman, Friedman, "Classification and regression trees", 1984.
> 
> I'll reply on SO as well.
> 
> Hope this helps,
> 
> Gilles
> 
> 

> On 4 April 2013 21:35, Peter Prettenhofer <[email protected]
> > wrote:
> I posted a brief description of the algorithm. The method that we 
> implement is briefly described in ESLII. Gilles is the expert here, 
> he can give more details on the issue.
> 

> 2013/4/4 Olivier Grisel <[email protected]>
> The variable importance in scikit-learn's implementation of random
> forest is based on the proportion of samples that were classified by
> the feature at some point in one of the decision trees evaluation.
> 
> http://scikit-learn.org/stable/modules/ensemble.html#feature-
> importance-evaluation
> 
> This method seems different from the OOB based method of Breiman 2001
> (section 10):
> 
> http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
> 
> Is there any reference for the method implemented in the scikit?
> 
> Here is the original Stack Overflow question:
> 
> http://stackoverflow.com/questions/15810339/how-are-feature-
> importances-in-randomforestclassifier-determined/15811003?
> noredirect=1#comment22487062_15811003
> 
> --
> Olivier
> http://twitter.com/ogrisel - http://github.com/ogrisel
> 
> 
------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 

> 
> -- 
> Peter Prettenhofer
> 
> 
------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

> 
------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire 
> the most talented Cisco Certified professionals. Visit the 
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/
> index.html_______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, you 
must not copy this message or attachment or disclose the contents to any other 
person. If you have received this transmission in error, please notify the 
sender immediately and delete the message and any attachment from your system. 
Merck KGaA, Darmstadt, Germany and any of its subsidiaries do not accept 
liability for any omissions or errors in this message which may arise as a 
result of E-Mail-transmission or for damages resulting from any unauthorized 
changes of the content of this message and any attachment thereto. Merck KGaA, 
Darmstadt, Germany and any of its subsidiaries do not guarantee that this 
message is free of viruses and does not accept liability for any damages caused 
by any virus transmitted therewith.

Click http://www.merckgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.

Attachment: RF_R.png
Description: Binary data

Attachment: RF_sklearn.png
Description: Binary data

------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire 
the most talented Cisco Certified professionals. Visit the 
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to