Hi all. I'm looking at the code behind one of the tree ensemble demos: http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html and I'm unsure about the error bars.
They are calculated using the standard deviation of the feature_importances_ attribute across trees. Can we depend on this being a Normal distribution? I'm wondering if the plot tells enough of the story to be genuinely useful? I don't have a strong belief in the likely distribution of feature_importances_, I haven't dug into how the feature importances are calculated (frankly I'm a bit lost here). I know that on a RF Regression case I'm working on I can see unimodal and bimodal feature importance distributions - this came up on a discussion on the yellowbrick sklearn visualisation package: https://github.com/DistrictDataLabs/yellowbrick/pull/195 I don't know what is "normal" for feature importances and if they look different between classification tasks (as in the plot_forest_importances demo) and regression tasks. Maybe I've got an outlier in my task? If I use the provided demo code then my error bars can go negative, so that feels unhelpful. Does anyone have an opinion? Perhaps more importantly - is a visual indication of the spread of feature importances in an ensemble actually a useful thing to plot? Does it serve a diagnostic value? I saw Sebastian Raschka's reference to Gilles Louppe et al.'s NIPS paper (in here, 2016-05-17) on variable importances, I'll dig into that if nobody has a strong opinion. BTW Sebastian - thanks for writing your book. Cheers, Ian. -- Ian Ozsvald (Data Scientist, PyDataLondon co-chair) i...@ianozsvald.com http://IanOzsvald.com http://ModelInsight.io http://twitter.com/IanOzsvald _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn