Github user asolimando commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20632#discussion_r168946478
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
    @@ -287,6 +292,41 @@ private[tree] class LearningNode(
         }
       }
     
    +  /**
    +   * Method testing whether a node is a leaf.
    +   * @return true iff a node is a leaf.
    +   */
    +  private def isLeafNode(): Boolean = leftChild.isEmpty && 
rightChild.isEmpty
    +
    +  /** True iff the node should be a leaf. */
    +  private lazy val shouldBeLeaf: Boolean = leafPredictions.size == 1
    +
    +  /**
    +   * Returns the set of (leaf) predictions appearing in the subtree rooted 
at the considered node.
    +   * @return the set of (leaf) predictions appearing in the subtree rooted 
at the given node.
    +   */
    +  private def leafPredictions: Set[Double] = {
    --- End diff --
    
    For construction we need the distinct predictions as "redundancy" is 
non-monotonic, because from being true on subtrees, it can become false: 
    left and right subtrees of a node can be redundant (left predicts only 0, 
right only 1) but the node itself is not (it can predict 0 or 1, depending on 
the split). This can be ascertained only but comparing the prediction values, 
you cannot derive it from whether two subtrees being redundant alone (as they 
can be redundant w.r.t. different values).
    
    Consider this example:
    1
    --1.1
    ----1.1.1 (pred 0)
    ----1.1.2 (pred 0)
    --1.2
    ----1.2.1 (pred 1)
    ----1.2.2 (pred 1)
    
    This can be pruned to 
    1
    --1.1 (pred 0)
    --1.2 (pred 1)
    
    As 1.1 and 1.2 are both redundant, but 1 is not (and to verify this, you 
need their prediction values).
    
    It is a different story if we go for storing "isRedundant" (the 
shouldBeLeaf of the original PR), that can instead be safely used to decide 
whether to prune. 
    
    As you pointed out this extra information is stored only in LearningNode, 
so I think it doesn't make a significant difference if we "store" the set of 
predictions or shouldBeLeaf, as these objects will be disposed anyway at the 
end of the translation, while what matters is to save computation time and 
multiple explorations (as you originally suggested).
    
    I would summarise it as follows, we either: 
    1) store the set of predictions, or
    2) shouldBeLeaf
    
    (1) requires more memory but it doesn't introduce extra variables (less 
code to understand), (2) has dual pro/con
    
    Does it make sense to you? Which option would you suggest?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to