Hello, a small recap for who is interested. There was already a ticket covering the case that I failed to find when I checked. As a result the other one has been correctly marked as duplicate: https://issues.apache.org/jira/browse/SPARK-3159
I have created a PR for this that you can check here if interested: https://github.com/apache/spark/pull/20632 On 13 February 2018 at 14:39, Alessandro Solimando < alessandro.solima...@gmail.com> wrote: > Thanks for your feedback Sean, I agree with you. > > I have logged a JIRA case (https://issues.apache.org/jir > a/browse/SPARK-23409), I will take a look at the code more in detail and > see if I come up with a PR to handle this. > > On 13 February 2018 at 12:00, Sean Owen <sro...@gmail.com> wrote: > >> I think the simple pruning you have in mind was just never implemented. >> >> That sort of pruning wouldn't help much if the nodes maintained a >> distribution over classes, as those are rarely identical, but, they just >> maintain a single class prediction. After training, I see no value in >> keeping those nodes. Whatever impurity gain the split managed on the >> training data is 'lost' when the prediction is collapsed to a single class >> anyway. >> >> Whether it's easy to implement in the code I don't know, but it's >> straightforward conceptually. >> >> On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando < >> alessandro.solima...@gmail.com> wrote: >> >>> Hello Nick, >>> thanks for the pointer, that's interesting. >>> >>> However, there seems to be a major difference with what I was discussing. >>> >>> The JIRA issue relates to overfitting and consideration on information >>> gain, while what I propose is a much simpler "syntactic" pruning. >>> >>> Consider a fragment of the example above, the leftmost subtree in >>> particular: >>> >>> If (feature 1 <= 0.5) >>>> If (feature 2 <= 0.5) >>>> If (feature 0 <= 0.5) >>>> Predict: 0.0 >>>> Else (feature 0 > 0.5) >>>> Predict: 0.0 >>>> Else (feature 2 > 0.5) >>>> If (feature 0 <= 0.5) >>>> Predict: 0.0 >>>> Else (feature 0 > 0.5) >>>> Predict: 0.0 >>> >>> >>> Which corresponds to the following "objects": >>> >>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split = >>>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000) >>>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split = >>>> org.apache.spark.ml.tree.ContinuousSplit@ffe00000) >>>> ---InternalNode(prediction = 0.0, impurity = 0.4444444444444445, split >>>> = org.apache.spark.ml.tree.ContinuousSplit@3fe00000) >>>> ----LeafNode(prediction = 0.0, impurity = -1.0) >>>> ----LeafNode(prediction = 0.0, impurity = 0.0) >>>> ---InternalNode(prediction = 0.0, impurity = 0.2777777777777777, split >>>> = org.apache.spark.ml.tree.ContinuousSplit@3fe00000) >>>> ----LeafNode(prediction = 0.0, impurity = 0.0) >>>> ----LeafNode(prediction = 0.0, impurity = -1.0) >>> >>> >>> For sure a more comprehensive policy for node splitting based on >>> impurity might prevent this situation (by splitting node "ffe00000" you >>> have an impurity gain on one child, and a loss on the other), but >>> independently from this, once the tree is built, I would cut the redundant >>> subtree and obtain the following: >>> >>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split = >>>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000) >>>> --LeafNode(prediction = 0.0, impurity = ...) >>> >>> >>> I cannot say that this is relevant for all the tree ensemble methods, >>> but it for sure is for RF, even more than for DT, as the lever effect will >>> be even higher (and the code generating them is the same, DT calls RF with >>> numTree = 1 for what I can see). >>> >>> Being an optimization aiming at saving model memory footprint and >>> invocation time, it is independent from any consideration on the >>> statistical amortization of overfit, as your reply seems to imply. >>> >>> Am I missing something? >>> >>> Best regards, >>> Alessandro >>> >>> >>> >>> On 13 February 2018 at 10:57, Nick Pentreath <nick.pentre...@gmail.com> >>> wrote: >>> >>>> There is a long outstanding JIRA issue about it: >>>> https://issues.apache.org/jira/browse/SPARK-3155. >>>> >>>> It is probably still a useful feature to have for trees but the >>>> priority is not that high since it may not be that useful for the tree >>>> ensemble models. >>>> >>>> >>>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando < >>>> alessandro.solima...@gmail.com> wrote: >>>> >>>>> Hello community, >>>>> I have recently manually inspected some decision trees computed with >>>>> Spark (2.2.1, but the behavior is the same with the latest code on the >>>>> repo). >>>>> >>>>> I have observed that the trees are always complete, even if an entire >>>>> subtree leads to the same prediction in its different leaves. >>>>> >>>>> In such case, the root of the subtree, instead of being an >>>>> InternalNode, could simply be a LeafNode with the (shared) prediction. >>>>> >>>>> I know that decision trees computed by scikit-learn share the same >>>>> feature, I understand that this is needed by construction, because you >>>>> realize this redundancy only at the end. >>>>> >>>>> So my question is, why is this "post-pruning" missing? >>>>> >>>>> Three hypothesis: >>>>> >>>>> 1) It is not suitable (for a reason I fail to see) >>>>> 2) Such addition to the code is considered as not worth (in terms of >>>>> code complexity, maybe) >>>>> 3) It has been overlooked, but could be a favorable addition >>>>> >>>>> For clarity, I have managed to isolate a small case to reproduce this, >>>>> in what follows. >>>>> >>>>> This is the dataset: >>>>> >>>>>> +-----+-------------+ >>>>>> |label|features | >>>>>> +-----+-------------+ >>>>>> |1.0 |[1.0,0.0,1.0]| >>>>>> |1.0 |[0.0,1.0,0.0]| >>>>>> |1.0 |[1.0,1.0,0.0]| >>>>>> |0.0 |[0.0,0.0,0.0]| >>>>>> |1.0 |[1.0,1.0,0.0]| >>>>>> |0.0 |[0.0,1.0,1.0]| >>>>>> |1.0 |[0.0,0.0,0.0]| >>>>>> |0.0 |[0.0,1.0,1.0]| >>>>>> |1.0 |[0.0,1.0,1.0]| >>>>>> |0.0 |[1.0,0.0,0.0]| >>>>>> |0.0 |[1.0,0.0,1.0]| >>>>>> |1.0 |[0.0,1.0,1.0]| >>>>>> |0.0 |[0.0,0.0,1.0]| >>>>>> |0.0 |[1.0,0.0,1.0]| >>>>>> |0.0 |[0.0,0.0,1.0]| >>>>>> |0.0 |[1.0,1.0,1.0]| >>>>>> |0.0 |[1.0,1.0,0.0]| >>>>>> |1.0 |[1.0,1.0,1.0]| >>>>>> |0.0 |[1.0,0.0,1.0]| >>>>>> +-----+-------------+ >>>>> >>>>> >>>>> Which generates the following model: >>>>> >>>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 >>>>>> with 15 nodes >>>>>> If (feature 1 <= 0.5) >>>>>> If (feature 2 <= 0.5) >>>>>> If (feature 0 <= 0.5) >>>>>> Predict: 0.0 >>>>>> Else (feature 0 > 0.5) >>>>>> Predict: 0.0 >>>>>> Else (feature 2 > 0.5) >>>>>> If (feature 0 <= 0.5) >>>>>> Predict: 0.0 >>>>>> Else (feature 0 > 0.5) >>>>>> Predict: 0.0 >>>>>> Else (feature 1 > 0.5) >>>>>> If (feature 2 <= 0.5) >>>>>> If (feature 0 <= 0.5) >>>>>> Predict: 1.0 >>>>>> Else (feature 0 > 0.5) >>>>>> Predict: 1.0 >>>>>> Else (feature 2 > 0.5) >>>>>> If (feature 0 <= 0.5) >>>>>> Predict: 0.0 >>>>>> Else (feature 0 > 0.5) >>>>>> Predict: 0.0 >>>>> >>>>> >>>>> As you can see, the following model would be equivalent, but smaller >>>>> and >>>>> >>>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3 >>>>>> with 15 nodes >>>>>> If (feature 1 <= 0.5) >>>>>> Predict: 0.0 >>>>>> Else (feature 1 > 0.5) >>>>>> If (feature 2 <= 0.5) >>>>>> Predict: 1.0 >>>>>> Else (feature 2 > 0.5) >>>>>> Predict: 0.0 >>>>> >>>>> >>>>> This happens pretty often in real cases, and despite the small gain in >>>>> the single model invocation for the "optimized" version, it can become non >>>>> negligible when the number of calls is massive, as one can expect in a Big >>>>> Data context. >>>>> >>>>> I would appreciate your opinion on this matter (if relevant for a PR >>>>> or not, pros/cons etc). >>>>> >>>>> Best regards, >>>>> Alessandro >>>>> >>>> >>> >