Sebastian, That does indeed help. I now understand that the calculated importance is indeed the average Gini importance. Thank you very much!
Efrem Braun Hi, Efrem, I agree, this can maybe cause confusion. However, to me, 1) > expected fraction of samples they contribute to, (though it is not explicitly > stated how this expectation value is calculated 2) > Per my understanding of the source code, the importance within each tree is > just the Gini importance, which is the reduction in Gini impurity that the > variable brings about. These are again then averaged over all trees in the > forest. sound like the same thing, where expected fraction refers to the information gain, e.g, Gini gain if Gini is the impurity measure. > though it is not explicitly stated how this expectation value is calculated > other than to say that it depends on how high up the feature contributes to a > split in the tree This doesn't special weighting or so: The higher you are up in the tree, the larger will be your information gain at a given node since this is basically the tree growing criterion. Where (simplified) "Information gain = Impurity(parent) - Impurity(child_left) - Impurity(child_right)". So basically all you do is average your impurity decrease over the trees for the respective features. Hope that helps! Best, Sebastian On Wed, Aug 5, 2015 at 5:33 PM, Efrem Braun <efrem.br...@berkeley.edu> wrote: > Hello, > > I would like to question what appears to me to be a discrepancy between > the source code and the documentation in regards to how the feature > importance is calculated for the random forest regressor. > > Per the documentation ( > http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation), > each of the variable's importances within each tree are calculated based on > the expected fraction of samples they contribute to, (though it is not > explicitly stated how this expectation value is calculated other than to > say that it depends on how high up the feature contributes to a split in > the tree), and these importances are then averaged over all trees in the > forest. > > Per my understanding of the source code, the importance within each tree > is just the Gini importance, which is the reduction in Gini impurity that > the variable brings about. These are again then averaged over all trees in > the forest. > > To me, these two definitions are not the same thing. Is the documentation > just attempting to provide a simplified definition of the Gini importance, > or is the importance that the code provides NOT the Gini importance? > > Thanks for your help. I'm new to machine learning, so I do understand that > the problem is most likely due to my own ignorance. > > Efrem Braun >
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general