[ 
https://issues.apache.org/jira/browse/SPARK-3158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3158:
---------------------------------
    Priority: Major  (was: Minor)

> Avoid 1 extra aggregation for DecisionTree training
> ---------------------------------------------------
>
>                 Key: SPARK-3158
>                 URL: https://issues.apache.org/jira/browse/SPARK-3158
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: computation
> Currently, the implementation does one unnecessary aggregation step.  The 
> aggregation step for level L (to choose splits) gives enough information to 
> set the predictions of any leaf nodes at level L+1.  We can use that info and 
> skip the aggregation step for the last level of the tree (which only has leaf 
> nodes).
> This update could be done by:
> * allocating a root node before the loop in the main train() method
> * allocating nodes for level L+1 while choosing splits for level L
> * caching stats in these newly allocated nodes, so that we can calculate 
> predictions if we know they will be leaves
> * DecisionTree.findBestSplits can just return doneTraining
> This will let us cache impurity and avoid re-calculating it in 
> calculateGainForSplit.
> Some above notes were copied from discussion in 
> [https://github.com/apache/spark/pull/2341]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to