Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Douglas Chan Mon, 12 Sep 2016 18:52:06 -0700

Thanks for sharing your comments about this, Piotr.

I agree with you that each ExtraTreesRegressor tree in the ensemble should sum 
to 1.  Though, at least for ExtraTreesRegressor, the sum is still near 1.  For 
GB, that sum keeps decreasing on and on.


I feel there’s a bug here so I just submitted one to track this issue:
https://github.com/scikit-learn/scikit-learn/issues/7406

-Doug


From: Piotr Bialecki 
Sent: Friday, September 09, 2016 5:11 AM
To: Scikit-learn user and developer mailing list 
Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum 
to 1

Hi Doug,

I modified your code a little bit to calculate the feature_importances of every 
tree of the forest.
In my opinion these feature importances should also sum to 1.0.

Since I could not access each DecisionTreeRegressor of your 
GradientBoositngRegressor, I created a new 
ExtraTreeRegressor.

This is a bit off topic, but does anyone have an idea, why 
type(ExtraTreesRegressor().estimators_) 
results in a list and 
type(GradientBoostingRegressor().estimators_)
results in an np.array?

Anyway, here is the code:

import numpy as np
from sklearn import datasets
from sklearn.ensemble import GradientBoostingRegressor, ExtraTreesRegressor
 
boston = datasets.load_boston()
X, Y = (boston.data, boston.target)
 
n_estimators = 712  
# Note: From 712 onwards, the feature importance sum is less than 1
params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
clf = GradientBoostingRegressor(**params)
clf.fit(X, Y)
 
feature_importance_sum = np.sum(clf.feature_importances_)
print "At n_estimators = %i, feature importance sum = %.20f" % (n_estimators , 
feature_importance_sum)


n_estimators_forest = 100
clf_forest = ExtraTreesRegressor(n_estimators=n_estimators_forest)
clf_forest.fit(X, Y)

feature_importance_sum_forest = np.sum(clf_forest.feature_importances_)
forest_feat_imp = [np.sum(tree.feature_importances_) for tree in 
clf_forest.estimators_]
print "At n_estimators = %i, feature importance sum = %.20f" % 
(n_estimators_forest, feature_importance_sum_forest)
for idx, imp in enumerate(forest_feat_imp):
    print "imp for tree %i: %.20f" % (idx, imp)


I suppose in each tree there is a small rounding error, summing up to the 
overall error.
So is this a bug or an inevitable rounding issue?


Greets,
Piotr


On 09.09.2016 03:51, Douglas Chan wrote:

  Hello everyone,

  I’d like to bring this up again to see if people have any thoughts on it.

  If you also think this is a bug, then we can track it and get it fixed.  
Please share your opinions.

  Thank you,
  -Doug


  From: Douglas Chan 
  Sent: Wednesday, August 31, 2016 4:52 PM
  To: Scikit-learn user and developer mailing list ; Raphael C 
  Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum 
to 1

  Thanks for your reply, Raphael.

  Here’s some code using the Boston dataset to reproduce this.  

  === START CODE ===
  import numpy as np
  from sklearn import datasets
  from sklearn.ensemble import GradientBoostingRegressor

  boston = datasets.load_boston()
  X, Y = (boston.data, boston.target)

  n_estimators = 712   
  # Note: From 712 onwards, the feature importance sum is less than 1

  params = {'n_estimators': n_estimators, 'max_depth': 6, 'learning_rate': 0.1}
  clf = GradientBoostingRegressor(**params)
  clf.fit(X, Y)

  feature_importance_sum = np.sum(clf.feature_importances_)
  print "At n_estimators = %i, feature importance sum = %f" % (n_estimators , 
feature_importance_sum)

  === END CODE ===

  If we deem this to be an error, I can open a bug to track it.  Please share 
your thoughts on it.

  Thank you,
  -Doug


  From: Raphael C 
  Sent: Tuesday, August 30, 2016 11:28 PM
  To: Scikit-learn user and developer mailing list 
  Subject: Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum 
to 1

  Can you provide a reproducible example? 
  Raphael

  On Wednesday, August 31, 2016, Douglas Chan <[email protected]> wrote:

    Hello everyone,

    I notice conditions when Feature Importance values do not add up to 1 in 
ensemble tree methods, like Gradient Boosting Trees or AdaBoost Trees.  I 
wonder if there’s a bug in the code.

    This error occurs when the ensemble has a large number of estimators.  The 
exact conditions depend variously.  For example, the error shows up sooner with 
a smaller amount of training samples.  Or, if the depth of the tree is large.  

    When this error appears, the predicted value seems to have converged.  But 
it’s unclear if the error is causing the predicted value not to change with 
more estimators.  In fact, the feature importance sum goes lower and lower with 
more estimators thereafter.

    I wonder if we’re hitting some floating point calculation error. 

    Looking forward to hear your thoughts on this.

    Thank you!
    -Doug


------------------------------------------------------------------------------
  _______________________________________________
  scikit-learn mailing list
  [email protected]
  https://mail.python.org/mailman/listinfo/scikit-learn


   

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn




--------------------------------------------------------------------------------
_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Gradient Boosting: Feature Importances do not sum to 1

Reply via email to