Hello,

I would like to question what appears to me to be a discrepancy between the
source code and the documentation in regards to how the feature importance
is calculated for the random forest regressor.

Per the documentation (
http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation),
each of the variable's importances within each tree are calculated based on
the expected fraction of samples they contribute to, (though it is not
explicitly stated how this expectation value is calculated other than to
say that it depends on how high up the feature contributes to a split in
the tree), and these importances are then averaged over all trees in the
forest.

Per my understanding of the source code, the importance within each tree is
just the Gini importance, which is the reduction in Gini impurity that the
variable brings about. These are again then averaged over all trees in the
forest.

To me, these two definitions are not the same thing. Is the documentation
just attempting to provide a simplified definition of the Gini importance,
or is the importance that the code provides NOT the Gini importance?

Thanks for your help. I'm new to machine learning, so I do understand that
the problem is most likely due to my own ignorance.


Efrem Braun
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to