Github user yanboliang commented on a diff in the pull request:
https://github.com/apache/spark/pull/7838#discussion_r36040523
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
@@ -1113,4 +1114,77 @@ private[ml] object RandomForest extends Logging {
}
}
+ /**
+ * Given a Random Forest model, compute the importance of each feature.
+ * This generalizes the idea of "Gini" importance to other losses,
+ * following the explanation of Gini importance from "Random Forests"
documentation
+ * by Leo Breiman and Adele Cutler, and following the implementation
from scikit-learn.
+ *
+ * This feature importance is calculated as follows:
+ * - Average over trees:
+ * - importance(feature j) = sum (over nodes which split on feature
j) of the gain,
+ * where gain is scaled by the number of instances passing through
node
+ * - Normalize importances for tree based on total number of
training instances used
+ * to build tree.
+ * - Normalize feature importance vector to sum to 1.
+ *
+ * Note: This should not be used with Gradient-Boosted Trees. It only
makes sense for
+ * independently trained trees.
+ * Note: This is returned as a Map since models do not store the number
of features.
+ * That should be corrected in the future.
+ * @param trees Unweighted forest of trees
+ * @return Feature importance values. Returned as map from feature
index to importance.
+ */
+ private[ml] def featureImportances(trees: Array[DecisionTreeModel]):
Map[Int, Double] = {
+ val totalImportances = new OpenHashMap[Int, Double]()
+ trees.foreach { tree =>
+ // Aggregate feature importance vector for this tree
+ val importances = new OpenHashMap[Int, Double]()
+ computeFeatureImportance(tree.rootNode, importances)
+ // Normalize importance vector for this tree, and add it to total.
+ val treeNorm = tree.rootNode.impurityStats.count
--- End diff --
@jkbradley Correctly, I agree with you.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]