Github user namma commented on a diff in the pull request:
https://github.com/apache/spark/pull/7838#discussion_r36029766
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1113,4 +1114,77 @@ private[ml] object RandomForest extends Logging {
}
}
+ /**
+ * Given a Random Forest model, compute the importance of each feature.
+ * This generalizes the idea of "Gini" importance to other losses,
+ * following the explanation of Gini importance from "Random Forests"
documentation
+ * by Leo Breiman and Adele Cutler, and following the implementation
from scikit-learn.
+ *
+ * This feature importance is calculated as follows:
+ * - Average over trees:
+ * - importance(feature j) = sum (over nodes which split on feature
j) of the gain,
+ * where gain is scaled by the number of instances passing through
node
+ * - Normalize importances for tree based on total number of
training instances used
+ * to build tree.
+ * - Normalize feature importance vector to sum to 1.
+ *
+ * Note: This should not be used with Gradient-Boosted Trees. It only
makes sense for
+ * independently trained trees.
+ * Note: This is returned as a Map since models do not store the number
of features.
+ * That should be corrected in the future.
+ * @param trees Unweighted forest of trees
+ * @return Feature importance values. Returned as map from feature
index to importance.
+ */
+ private[ml] def featureImportances(trees: Array[DecisionTreeModel]):
Map[Int, Double] = {
+ val totalImportances = new OpenHashMap[Int, Double]()
+ trees.foreach { tree =>
+ // Aggregate feature importance vector for this tree
+ val importances = new OpenHashMap[Int, Double]()
+ computeFeatureImportance(tree.rootNode, importances)
+ // Normalize importance vector for this tree, and add it to total.
+ val treeNorm = tree.rootNode.impurityStats.count
--- End diff --
I am not really sure if I interpret your comment correctly.. But I believe
the code is rightly written.
- The normalization to make 'importances' sum to 1 (in your comment) is
done at line 1154 below, on the final vector of feature importances.
- The normalization at this piece of code (started at line 1144) is to
normalize the weight of the nodes in the tree by dividing #instances at the
node by total #instances. The impt scores are then summed up to have final
vector of feature importances.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]