[scikit-learn] Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html

Ian Ozsvald Fri, 23 Jun 2017 10:15:06 -0700

Hi all. I'm looking at the code behind one of the tree ensemble demos:
http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
and I'm unsure about the error bars.


They are calculated using the standard deviation of the
feature_importances_ attribute across trees. Can we depend on this
being a Normal distribution? I'm wondering if the plot tells enough of
the story to be genuinely useful?

I don't have a strong belief in the likely distribution of
feature_importances_, I haven't dug into how the feature importances
are calculated (frankly I'm a bit lost here). I know that on a RF
Regression case I'm working on I can see unimodal and bimodal feature
importance distributions - this came up on a discussion on the
yellowbrick sklearn visualisation package:
https://github.com/DistrictDataLabs/yellowbrick/pull/195

I don't know what is "normal" for feature importances and if they look
different between classification tasks (as in the
plot_forest_importances demo) and regression tasks. Maybe I've got an
outlier in my task? If I use the provided demo code then my error bars
can go negative, so that feels unhelpful.

Does anyone have an opinion? Perhaps more importantly - is a visual
indication of the spread of feature importances in an ensemble
actually a useful thing to plot? Does it serve a diagnostic value?

I saw Sebastian Raschka's reference to Gilles Louppe et al.'s NIPS
paper (in here, 2016-05-17) on variable importances, I'll dig into
that if nobody has a strong opinion. BTW Sebastian - thanks for
writing your book.

Cheers, Ian.

-- 
Ian Ozsvald (Data Scientist, PyDataLondon co-chair)
i...@ianozsvald.com

http://IanOzsvald.com
http://ModelInsight.io
http://twitter.com/IanOzsvald
_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

[scikit-learn] Query about use of standard deviation on tree feature_importances_ in demo plot_forest_importances.html

Reply via email to