Github user asolimando commented on the issue:
https://github.com/apache/spark/pull/20632
Hello Sean,
here is my understanding of the problem and the main intuition of the
proposed solution:
We want to have a tree such that it does not contain any redundant subtree.
A subtree is redundant iff it is not composed by a leaf and all the leaves
belonging to it agree on the prediction (the leafPredictions has at least two
distinct values).
Given that for any subtree there is exactly one root node, we "store" this
information in its root (learning) node.
A redundant node can be replaced by a single leaf, predicting the single
prediction value reachable via that subtree.
So, there are basically two cases in our top-down exploration:
1) The subtree is redundant -> the whole subtree is explored (linearly) to
collect the predictions, the node is translated into a LeafNode, and thus the
toNode method is never called on its children, so no extra exploration
2) The subtree is *not* redundant -> the whole subtree is explored
(linearly) to collect the predictions, the node is translated into an
InternalNode, and toNode recursively called on its children. However, at this
point, the lazy value shouldBeLeaf (private lazy val shouldBeLeaf: Boolean) has
been already set in the previous call, so there won't be any further
exploration. Each further step down the recursion path will be able to "reason"
locally on stored shouldBeLeaf values.
Indeed, I was about to switch from the first top-down version to a
bottom-up one for avoiding the repeated explorations, but then I realized that
"lazy" could achieve the same result, while keeping the top-down exploration
that is, in my opinion, more straightforward to read.
I have updated the PR according to your suggestions, I haven't squashed the
commits yet in case there will be further comments.
I will reply to your comments below so I hope I won't miss anything.
Thanks for your help.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]