GitHub user chouqin opened a pull request:

    https://github.com/apache/spark/pull/2708

    [SPARK-3158][MLLIB]Avoid 1 extra aggregation for DecisionTree training

    Currently, the implementation does one unnecessary aggregation step. The 
aggregation step for level L (to choose splits) gives enough information to set 
the predictions of any leaf nodes at level L+1. We can use that info and skip 
the aggregation step for the last level of the tree (which only has leaf nodes).
    
    ### Implementation Details
    
    Each node now has a `impurity` field and the `predict` is changed from type 
`Double` to type `Predict`(this can be used to compute predict probability in 
the future) When compute best splits for each node, we also compute impurity 
and predict for the child nodes, which is used to constructed newly allocated 
child nodes. So at level L, we have set impurity and predict for nodes at level 
L +1.
    If level L+1 is the last level, then we can avoid aggregation. What's more, 
calculation of parent impurity in 
    
    
    Top nodes for each tree needs to be treated differently because we have to 
compute impurity and predict for them first. In `binsToBestSplit`, if current 
node is top node(level == 0), we calculate impurity and predict first. 
    after finding best split, top node's predict and impurity is set to the 
calculated value. Non-top nodes's impurity and predict are already calculated 
and don't need to be recalculated again. I have considered to add a 
initialization step to set top nodes' impurity and predict and then we can 
treat all nodes in the same way, but this will need a lot of duplication of 
code(all the code to do seq operation(BinSeqOp) needs to be duplicated), so I 
choose the current way.
    
     

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/chouqin/spark avoid-agg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/2708.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2708
    
----
commit 6cc0333bc02332bcf94d75c00b6850ea4d4e79f6
Author: Qiping Li <[email protected]>
Date:   2014-10-08T04:03:35Z

    SPARK-3158: Avoid 1 extra aggregation for DecisionTree training

commit e41d715bf35bc1dd948fdb2c60317fd66f86fdec
Author: Qiping Li <[email protected]>
Date:   2014-10-08T04:16:01Z

    fix bug in test suite

commit 822c91274526e77528ef0a1c4a0e92a14f5696a5
Author: Qiping Li <[email protected]>
Date:   2014-10-08T07:32:19Z

    add comments and unit test

commit 7ad7a71a0022ff808fb0066b68fe07a8c1a830b4
Author: Qiping Li <[email protected]>
Date:   2014-10-08T07:47:57Z

    fix unit test

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to