Hello,
a small recap for who is interested.

There was already a ticket covering the case that I failed to find when I
checked.
As a result the other one has been correctly marked as duplicate:
https://issues.apache.org/jira/browse/SPARK-3159

I have created a PR for this that you can check here if interested:
https://github.com/apache/spark/pull/20632

On 13 February 2018 at 14:39, Alessandro Solimando <
alessandro.solima...@gmail.com> wrote:

> Thanks for your feedback Sean, I agree with you.
>
> I have logged a JIRA case (https://issues.apache.org/jir
> a/browse/SPARK-23409), I will take a look at the code more in detail and
> see if I come up with a PR to handle this.
>
> On 13 February 2018 at 12:00, Sean Owen <sro...@gmail.com> wrote:
>
>> I think the simple pruning you have in mind was just never implemented.
>>
>> That sort of pruning wouldn't help much if the nodes maintained a
>> distribution over classes, as those are rarely identical, but, they just
>> maintain a single class prediction. After training, I see no value in
>> keeping those nodes. Whatever impurity gain the split managed on the
>> training data is 'lost' when the prediction is collapsed to a single class
>> anyway.
>>
>> Whether it's easy to implement in the code I don't know, but it's
>> straightforward conceptually.
>>
>> On Tue, Feb 13, 2018 at 4:21 AM Alessandro Solimando <
>> alessandro.solima...@gmail.com> wrote:
>>
>>> Hello Nick,
>>> thanks for the pointer, that's interesting.
>>>
>>> However, there seems to be a major difference with what I was discussing.
>>>
>>> The JIRA issue relates to overfitting and consideration on information
>>> gain, while what I propose is a much simpler "syntactic" pruning.
>>>
>>> Consider a fragment of the example above, the leftmost subtree in
>>> particular:
>>>
>>> If (feature 1 <= 0.5)
>>>>    If (feature 2 <= 0.5)
>>>>     If (feature 0 <= 0.5)
>>>>      Predict: 0.0
>>>>     Else (feature 0 > 0.5)
>>>>      Predict: 0.0
>>>>    Else (feature 2 > 0.5)
>>>>     If (feature 0 <= 0.5)
>>>>      Predict: 0.0
>>>>     Else (feature 0 > 0.5)
>>>>      Predict: 0.0
>>>
>>>
>>> Which corresponds to the following "objects":
>>>
>>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000)
>>>> --InternalNode(prediction = 0.0, impurity = 0.345679012345679, split =
>>>> org.apache.spark.ml.tree.ContinuousSplit@ffe00000)
>>>> ---InternalNode(prediction = 0.0, impurity = 0.4444444444444445, split
>>>> = org.apache.spark.ml.tree.ContinuousSplit@3fe00000)
>>>> ----LeafNode(prediction = 0.0, impurity = -1.0)
>>>> ----LeafNode(prediction = 0.0, impurity = 0.0)
>>>> ---InternalNode(prediction = 0.0, impurity = 0.2777777777777777, split
>>>> = org.apache.spark.ml.tree.ContinuousSplit@3fe00000)
>>>> ----LeafNode(prediction = 0.0, impurity = 0.0)
>>>> ----LeafNode(prediction = 0.0, impurity = -1.0)
>>>
>>>
>>> For sure a more comprehensive policy for node splitting based on
>>> impurity might prevent this situation (by splitting node "ffe00000" you
>>> have an impurity gain on one child, and a loss on the other), but
>>> independently from this, once the tree is built, I would cut the redundant
>>> subtree and obtain the following:
>>>
>>> -InternalNode(prediction = 0.0, impurity = 0.48753462603878117, split =
>>>> org.apache.spark.ml.tree.ContinuousSplit@fdf00000)
>>>> --LeafNode(prediction = 0.0, impurity = ...)
>>>
>>>
>>> I cannot say that this is relevant for all the tree ensemble methods,
>>> but it for sure is for RF, even more than for DT, as the lever effect will
>>> be even higher (and the code generating them is the same, DT calls RF with
>>> numTree = 1 for what I can see).
>>>
>>> Being an optimization aiming at saving model memory footprint and
>>> invocation time, it is independent from any consideration on the
>>> statistical amortization of overfit, as your reply seems to imply.
>>>
>>> Am I missing something?
>>>
>>> Best regards,
>>> Alessandro
>>>
>>>
>>>
>>> On 13 February 2018 at 10:57, Nick Pentreath <nick.pentre...@gmail.com>
>>> wrote:
>>>
>>>> There is a long outstanding JIRA issue about it:
>>>> https://issues.apache.org/jira/browse/SPARK-3155.
>>>>
>>>> It is probably still a useful feature to have for trees but the
>>>> priority is not that high since it may not be that useful for the tree
>>>> ensemble models.
>>>>
>>>>
>>>> On Tue, 13 Feb 2018 at 11:52 Alessandro Solimando <
>>>> alessandro.solima...@gmail.com> wrote:
>>>>
>>>>> Hello community,
>>>>> I have recently manually inspected some decision trees computed with
>>>>> Spark (2.2.1, but the behavior is the same with the latest code on the
>>>>> repo).
>>>>>
>>>>> I have observed that the trees are always complete, even if an entire
>>>>> subtree leads to the same prediction in its different leaves.
>>>>>
>>>>> In such case, the root of the subtree, instead of being an
>>>>> InternalNode, could simply be a LeafNode with the (shared) prediction.
>>>>>
>>>>> I know that decision trees computed by scikit-learn share the same
>>>>> feature, I understand that this is needed by construction, because you
>>>>> realize this redundancy only at the end.
>>>>>
>>>>> So my question is, why is this "post-pruning" missing?
>>>>>
>>>>> Three hypothesis:
>>>>>
>>>>> 1) It is not suitable (for a reason I fail to see)
>>>>> 2) Such addition to the code is considered as not worth (in terms of
>>>>> code complexity, maybe)
>>>>> 3) It has been overlooked, but could be a favorable addition
>>>>>
>>>>> For clarity, I have managed to isolate a small case to reproduce this,
>>>>> in what follows.
>>>>>
>>>>> This is the dataset:
>>>>>
>>>>>> +-----+-------------+
>>>>>> |label|features     |
>>>>>> +-----+-------------+
>>>>>> |1.0  |[1.0,0.0,1.0]|
>>>>>> |1.0  |[0.0,1.0,0.0]|
>>>>>> |1.0  |[1.0,1.0,0.0]|
>>>>>> |0.0  |[0.0,0.0,0.0]|
>>>>>> |1.0  |[1.0,1.0,0.0]|
>>>>>> |0.0  |[0.0,1.0,1.0]|
>>>>>> |1.0  |[0.0,0.0,0.0]|
>>>>>> |0.0  |[0.0,1.0,1.0]|
>>>>>> |1.0  |[0.0,1.0,1.0]|
>>>>>> |0.0  |[1.0,0.0,0.0]|
>>>>>> |0.0  |[1.0,0.0,1.0]|
>>>>>> |1.0  |[0.0,1.0,1.0]|
>>>>>> |0.0  |[0.0,0.0,1.0]|
>>>>>> |0.0  |[1.0,0.0,1.0]|
>>>>>> |0.0  |[0.0,0.0,1.0]|
>>>>>> |0.0  |[1.0,1.0,1.0]|
>>>>>> |0.0  |[1.0,1.0,0.0]|
>>>>>> |1.0  |[1.0,1.0,1.0]|
>>>>>> |0.0  |[1.0,0.0,1.0]|
>>>>>> +-----+-------------+
>>>>>
>>>>>
>>>>> Which generates the following model:
>>>>>
>>>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3
>>>>>> with 15 nodes
>>>>>>   If (feature 1 <= 0.5)
>>>>>>    If (feature 2 <= 0.5)
>>>>>>     If (feature 0 <= 0.5)
>>>>>>      Predict: 0.0
>>>>>>     Else (feature 0 > 0.5)
>>>>>>      Predict: 0.0
>>>>>>    Else (feature 2 > 0.5)
>>>>>>     If (feature 0 <= 0.5)
>>>>>>      Predict: 0.0
>>>>>>     Else (feature 0 > 0.5)
>>>>>>      Predict: 0.0
>>>>>>   Else (feature 1 > 0.5)
>>>>>>    If (feature 2 <= 0.5)
>>>>>>     If (feature 0 <= 0.5)
>>>>>>      Predict: 1.0
>>>>>>     Else (feature 0 > 0.5)
>>>>>>      Predict: 1.0
>>>>>>    Else (feature 2 > 0.5)
>>>>>>     If (feature 0 <= 0.5)
>>>>>>      Predict: 0.0
>>>>>>     Else (feature 0 > 0.5)
>>>>>>      Predict: 0.0
>>>>>
>>>>>
>>>>> As you can see, the following model would be equivalent, but smaller
>>>>> and
>>>>>
>>>>> DecisionTreeClassificationModel (uid=dtc_e794a5a3aa9e) of depth 3
>>>>>> with 15 nodes
>>>>>>   If (feature 1 <= 0.5)
>>>>>>    Predict: 0.0
>>>>>>   Else (feature 1 > 0.5)
>>>>>>    If (feature 2 <= 0.5)
>>>>>>     Predict: 1.0
>>>>>>    Else (feature 2 > 0.5)
>>>>>>     Predict: 0.0
>>>>>
>>>>>
>>>>> This happens pretty often in real cases, and despite the small gain in
>>>>> the single model invocation for the "optimized" version, it can become non
>>>>> negligible when the number of calls is massive, as one can expect in a Big
>>>>> Data context.
>>>>>
>>>>> I would appreciate your opinion on this matter (if relevant for a PR
>>>>> or not, pros/cons etc).
>>>>>
>>>>> Best regards,
>>>>> Alessandro
>>>>>
>>>>
>>>
>

Reply via email to