Hello, I would like to question what appears to me to be a discrepancy between the source code and the documentation in regards to how the feature importance is calculated for the random forest regressor.
Per the documentation ( http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation), each of the variable's importances within each tree are calculated based on the expected fraction of samples they contribute to, (though it is not explicitly stated how this expectation value is calculated other than to say that it depends on how high up the feature contributes to a split in the tree), and these importances are then averaged over all trees in the forest. Per my understanding of the source code, the importance within each tree is just the Gini importance, which is the reduction in Gini impurity that the variable brings about. These are again then averaged over all trees in the forest. To me, these two definitions are not the same thing. Is the documentation just attempting to provide a simplified definition of the Gini importance, or is the importance that the code provides NOT the Gini importance? Thanks for your help. I'm new to machine learning, so I do understand that the problem is most likely due to my own ignorance. Efrem Braun
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general