Hi Paul,
sorry to jump into that discussion, but it raised my interest..
> In the R RandomForest package, MeanDecreaseGini can be calculated.
>
>
> Does scikit-learn somehow scale MeanDecreaseGini to the percentage scale.
>
Yes, in randomForest R package there is basically no scaling or
normalization.
In the RandomForest package, the mean decrease is the total weighted gini
decrease summed over all nodes splitting on that feature, averaged over all
trees. The gini decreases are weighted by the number of samples in the
corresponding nodes while in scikit-learn, they are weighted by the
proportion of samples. We use that definition to have a measure that is
independent from the number of samples. (But both are equivalent, modulo
some constant factor)
Also, in scikit-learn, the feature importances vector is normalized to have
unit norm while there is no such post-processing in randomforest R package.
>
> Please find attached the variable importance as compute by scikit-learn's
> RF & R's RF.
>
>
>
>
> In the R case, I only had 10 features, but in the sklearn case, there were
> a few more.
> Of course, one cannot compare the absolute numbers of
> VariableImportance/MeanDecreaseGini, but I'm astonished to get that large
> values in the R implementation.
>
Please see my comments above. This is not surprising given the
normalization scheme we use.
Note that you should also consider the same sets of features for comparable
importances. Basically, since the importance of a feature measures
multi-variate effects, any relevant feature might affect the importance of
a feature. Therefore, using different feature sets might lead to
significantly different results.
Hope this answers some of your questions,
best,
Gilles
> Cheers & Thanks,
> Paul
>
>
>
> >
> > Hi Olivier,
> >
> > There are indeed several ways to get feature "importances". As
> > often, there is no strict consensus about what this word means.
> >
> > In our case, we implement the importance as described in [1] (often
> > cited, but unfortunately rarely read...). It is sometimes called
> > "gini importance" or "mean decrease impurity" and is defined as the
> > total decrease in node impurity (weighted by the probability of
> > reaching that node (which is approximated by the proportion of
> > samples)) averaged over all trees of the ensemble.
> >
> > The other measure is the one you describe. It is sometimes called
> > "mean decrease accuracy". It is more intensive to compute since it
> > requires (repeated) random permutations of each feature. It also
> > works only with bootstrapping.
> >
> > Note that both measures are available in the randomForest R package.
> >
> > [1]: Breiman, Friedman, "Classification and regression trees", 1984.
> >
> > I'll reply on SO as well.
> >
> > Hope this helps,
> >
> > Gilles
> >
> >
>
> > On 4 April 2013 21:35, Peter Prettenhofer <[email protected]
> > > wrote:
> > I posted a brief description of the algorithm. The method that we
> > implement is briefly described in ESLII. Gilles is the expert here,
> > he can give more details on the issue.
> >
>
> > 2013/4/4 Olivier Grisel <[email protected]>
> > The variable importance in scikit-learn's implementation of random
> > forest is based on the proportion of samples that were classified by
> > the feature at some point in one of the decision trees evaluation.
> >
> > http://scikit-learn.org/stable/modules/ensemble.html#feature-
> > importance-evaluation
> >
> > This method seems different from the OOB based method of Breiman 2001
> > (section 10):
> >
> > http://www.stat.berkeley.edu/~breiman/randomforest2001.pdf
> >
> > Is there any reference for the method implemented in the scikit?
> >
> > Here is the original Stack Overflow question:
> >
> > http://stackoverflow.com/questions/15810339/how-are-feature-
> > importances-in-randomforestclassifier-determined/15811003?
> > noredirect=1#comment22487062_15811003
> >
> > --
> > Olivier
> > http://twitter.com/ogrisel - http://github.com/ogrisel
> >
> >
>
> ------------------------------------------------------------------------------
> > Minimize network downtime and maximize team effectiveness.
> > Reduce network management and security costs.Learn how to hire
> > the most talented Cisco Certified professionals. Visit the
> > Employer Resources Portal
> > http://www.cisco.com/web/learning/employer_resources/index.html
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> >
>
> >
> > --
> > Peter Prettenhofer
> >
> >
>
> ------------------------------------------------------------------------------
> > Minimize network downtime and maximize team effectiveness.
> > Reduce network management and security costs.Learn how to hire
> > the most talented Cisco Certified professionals. Visit the
> > Employer Resources Portal
> > http://www.cisco.com/web/learning/employer_resources/index.html
> > _______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> >
>
> ------------------------------------------------------------------------------
> > Minimize network downtime and maximize team effectiveness.
> > Reduce network management and security costs.Learn how to hire
> > the most talented Cisco Certified professionals. Visit the
> > Employer Resources Portal
> > http://www.cisco.com/web/learning/employer_resources/
> > index.html_______________________________________________
> > Scikit-learn-general mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> This message and any attachment are confidential and may be privileged or
> otherwise protected from disclosure. If you are not the intended recipient,
> you must not copy this message or attachment or disclose the contents to
> any other person. If you have received this transmission in error, please
> notify the sender immediately and delete the message and any attachment
> from your system. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not accept liability for any omissions or errors in this
> message which may arise as a result of E-Mail-transmission or for damages
> resulting from any unauthorized changes of the content of this message and
> any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its
> subsidiaries do not guarantee that this message is free of viruses and does
> not accept liability for any damages caused by any virus transmitted
> therewith.
>
> Click http://www.merckgroup.com/disclaimer to access the German, French,
> Spanish and Portuguese versions of this disclaimer.
>
> ------------------------------------------------------------------------------
> Minimize network downtime and maximize team effectiveness.
> Reduce network management and security costs.Learn how to hire
> the most talented Cisco Certified professionals. Visit the
> Employer Resources Portal
> http://www.cisco.com/web/learning/employer_resources/index.html
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Minimize network downtime and maximize team effectiveness.
Reduce network management and security costs.Learn how to hire
the most talented Cisco Certified professionals. Visit the
Employer Resources Portal
http://www.cisco.com/web/learning/employer_resources/index.html
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general