Github user srowen commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20632#discussion_r168936292
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/tree/Node.scala ---
    @@ -287,6 +292,41 @@ private[tree] class LearningNode(
         }
       }
     
    +  /**
    +   * Method testing whether a node is a leaf.
    +   * @return true iff a node is a leaf.
    +   */
    +  private def isLeafNode(): Boolean = leftChild.isEmpty && 
rightChild.isEmpty
    +
    +  /** True iff the node should be a leaf. */
    +  private lazy val shouldBeLeaf: Boolean = leafPredictions.size == 1
    +
    +  /**
    +   * Returns the set of (leaf) predictions appearing in the subtree rooted 
at the considered node.
    +   * @return the set of (leaf) predictions appearing in the subtree rooted 
at the given node.
    +   */
    +  private def leafPredictions: Set[Double] = {
    --- End diff --
    
    Yeah, and short-circuiting involves an `if` statement anyway, so maybe no 
value in my previous suggestion about `foreach`. 
    
    Now this saves the whole `Set[Double]` of predictions when it's never used 
again. It's probably worth not hogging that memory. The method here doesn't 
actually need to form the set of all predictions, just decide whether there's 
more than 1. For example, in the case that there are no distinct values after 
evaluating children, no point in adding the leaf's prediction to it just to 
decide it's got 1 unique value.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to