Github user namma commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7838#discussion_r36029766
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala ---
    @@ -1113,4 +1114,77 @@ private[ml] object RandomForest extends Logging {
         }
       }
     
    +  /**
    +   * Given a Random Forest model, compute the importance of each feature.
    +   * This generalizes the idea of "Gini" importance to other losses,
    +   * following the explanation of Gini importance from "Random Forests" 
documentation
    +   * by Leo Breiman and Adele Cutler, and following the implementation 
from scikit-learn.
    +   *
    +   * This feature importance is calculated as follows:
    +   *  - Average over trees:
    +   *     - importance(feature j) = sum (over nodes which split on feature 
j) of the gain,
    +   *       where gain is scaled by the number of instances passing through 
node
    +   *     - Normalize importances for tree based on total number of 
training instances used
    +   *       to build tree.
    +   *  - Normalize feature importance vector to sum to 1.
    +   *
    +   * Note: This should not be used with Gradient-Boosted Trees.  It only 
makes sense for
    +   *       independently trained trees.
    +   * Note: This is returned as a Map since models do not store the number 
of features.
    +   *       That should be corrected in the future.
    +   * @param trees  Unweighted forest of trees
    +   * @return  Feature importance values.  Returned as map from feature 
index to importance.
    +   */
    +  private[ml] def featureImportances(trees: Array[DecisionTreeModel]): 
Map[Int, Double] = {
    +    val totalImportances = new OpenHashMap[Int, Double]()
    +    trees.foreach { tree =>
    +      // Aggregate feature importance vector for this tree
    +      val importances = new OpenHashMap[Int, Double]()
    +      computeFeatureImportance(tree.rootNode, importances)
    +      // Normalize importance vector for this tree, and add it to total.
    +      val treeNorm = tree.rootNode.impurityStats.count
    --- End diff --
    
    I am not really sure if I interpret your comment correctly.. But I believe 
the code is rightly written. 
    - The normalization to make 'importances' sum to 1 (in your comment) is 
done at line 1154 below, on the final vector of feature importances. 
    - The normalization at this piece of code (started at line 1144) is to 
normalize the weight of the nodes in the tree by dividing #instances at the 
node by total #instances. The impt scores are then summed up to have final 
vector of feature importances.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to