Hi, Efrem,

I agree, this can maybe cause confusion. 
However, to me,  

1)

> expected fraction of samples they contribute to, (though it is not explicitly 
> stated how this expectation value is calculated


2)

> Per my understanding of the source code, the importance within each tree is 
> just the Gini importance, which is the reduction in Gini impurity that the 
> variable brings about. These are again then averaged over all trees in the 
> forest.

sound like the same thing, where expected fraction refers to the information 
gain, e.g, Gini gain if Gini is the impurity measure.

> though it is not explicitly stated how this expectation value is calculated 
> other than to say that it depends on how high up the feature contributes to a 
> split in the tree


This doesn't special weighting or so: The higher you are up in the tree, the 
larger will be your information gain at a given node since this is basically 
the tree growing criterion. Where (simplified) "Information gain = 
Impurity(parent) - Impurity(child_left) - Impurity(child_right)". So basically 
all you do is average your impurity decrease over the trees for the respective 
features.

Hope that helps!

Best,
Sebastian

> On Aug 5, 2015, at 11:36 AM, Efrem Braun <efrem.br...@berkeley.edu> wrote:
> 
> Hello,
> 
> I would like to question what appears to me to be a discrepancy between the 
> source code and the documentation in regards to how the feature importance is 
> calculated for the random forest regressor.
> 
> Per the documentation 
> (http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation
>  
> <http://scikit-learn.org/stable/modules/ensemble.html#feature-importance-evaluation>),
>  each of the variable's importances within each tree are calculated based on 
> the expected fraction of samples they contribute to, (though it is not 
> explicitly stated how this expectation value is calculated other than to say 
> that it depends on how high up the feature contributes to a split in the 
> tree), and these importances are then averaged over all trees in the forest.
> 
> Per my understanding of the source code, the importance within each tree is 
> just the Gini importance, which is the reduction in Gini impurity that the 
> variable brings about. These are again then averaged over all trees in the 
> forest.
> 
> To me, these two definitions are not the same thing. Is the documentation 
> just attempting to provide a simplified definition of the Gini importance, or 
> is the importance that the code provides NOT the Gini importance?
> 
> Thanks for your help. I'm new to machine learning, so I do understand that 
> the problem is most likely due to my own ignorance.
> 
> 
> Efrem Braun
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to