Sebastian,

That does indeed help. I now understand that the calculated importance is
indeed the average Gini importance. Thank you very much!

Efrem Braun

Hi, Efrem,

I agree, this can maybe cause confusion.
However, to me,

1)

> expected fraction of samples they contribute to, (though it is not explicitly 
> stated how this expectation value is calculated


2)

> Per my understanding of the source code, the importance within each tree is 
> just the Gini importance, which is the reduction in Gini impurity that the 
> variable brings about. These are again then averaged over all trees in the 
> forest.

sound like the same thing, where expected fraction refers to the
information gain, e.g, Gini gain if Gini is the impurity measure.

> though it is not explicitly stated how this expectation value is calculated 
> other than to say that it depends on how high up the feature contributes to a 
> split in the tree


This doesn't special weighting or so: The higher you are up in the
tree, the larger will be your information gain at a given node since
this is basically the tree growing criterion. Where (simplified)
"Information gain = Impurity(parent) - Impurity(child_left) -
Impurity(child_right)". So basically all you do is average your
impurity decrease over the trees for the respective features.

Hope that helps!

Best,
Sebastian


On Wed, Aug 5, 2015 at 5:33 PM, Efrem Braun <efrem.br...@berkeley.edu>
wrote:

> Hello,
>
> I would like to question what appears to me to be a discrepancy between
> the source code and the documentation in regards to how the feature
> importance is calculated for the random forest regressor.
>
> Per the documentation (
> http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation),
> each of the variable's importances within each tree are calculated based on
> the expected fraction of samples they contribute to, (though it is not
> explicitly stated how this expectation value is calculated other than to
> say that it depends on how high up the feature contributes to a split in
> the tree), and these importances are then averaged over all trees in the
> forest.
>
> Per my understanding of the source code, the importance within each tree
> is just the Gini importance, which is the reduction in Gini impurity that
> the variable brings about. These are again then averaged over all trees in
> the forest.
>
> To me, these two definitions are not the same thing. Is the documentation
> just attempting to provide a simplified definition of the Gini importance,
> or is the importance that the code provides NOT the Gini importance?
>
> Thanks for your help. I'm new to machine learning, so I do understand that
> the problem is most likely due to my own ignorance.
>
> Efrem Braun
>
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to