[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning

Qiping Li (JIRA) Wed, 27 Aug 2014 19:19:26 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113224#comment-14113224
 ]


Qiping Li edited comment on SPARK-3155 at 8/28/14 2:18 AM:
-----------------------------------------------------------

Joseph has submitted PR [#2125|https://github.com/apache/spark/pull/2125], do I 
need to implement min info gain based on that? 


was (Author: chouqin):
Joseph has submitted PR [#2125|https://github.com/apache/spark/pull/2125], do I 
need to implement min info gain based on this? 

> Support DecisionTree pruning
> ----------------------------
>
>                 Key: SPARK-3155
>                 URL: https://issues.apache.org/jira/browse/SPARK-3155
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3155) Support DecisionTree pruning

Reply via email to