[
https://issues.apache.org/jira/browse/SPARK-11730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15014087#comment-15014087
]
Seth Hendrickson commented on SPARK-11730:
------------------------------------------
I can work on this.
I was taking a look at the feature importance for random forest and it seems
the feature importance for single decision trees was implemented but not added
to the decision tree APIs. Since GBT feature importance will likely be some
aggregation of the individual tree importances, I think we'll need to add it
for decision trees first. I can create a Jira to add {{featureImportances}} to
decision trees.
Regarding how it should be computed, I can verify that scikit-learn computes it
as the average of feature importances across all of the trees in the ensemble.
Taking a look at the R vignette, I think that is how they do it as well. The
current implementation in spark.ml for random forests averages the importances
across all trees as well, but notes specifically not to do this for GBT.
[~josephkb] could you clarify this note and add if you have something in mind
that works for GBT? I haven't found a standard way of computing it for GBT
other than what is in scikit.
> Feature Importance for GBT
> --------------------------
>
> Key: SPARK-11730
> URL: https://issues.apache.org/jira/browse/SPARK-11730
> Project: Spark
> Issue Type: New Feature
> Components: ML, MLlib
> Reporter: Brian Webb
>
> Random Forests have feature importance, but GBT do not. It would be great if
> we can add feature importance to GBT as well. Perhaps the code in Random
> Forests can be refactored to apply to both types of ensembles.
> See https://issues.apache.org/jira/browse/SPARK-5133
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]