[ https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113224#comment-14113224 ]
Qiping Li edited comment on SPARK-3155 at 8/28/14 2:18 AM: ----------------------------------------------------------- Joseph has submitted PR [#2125|https://github.com/apache/spark/pull/2125], do I need to implement min info gain based on that? was (Author: chouqin): Joseph has submitted PR [#2125|https://github.com/apache/spark/pull/2125], do I need to implement min info gain based on this? > Support DecisionTree pruning > ---------------------------- > > Key: SPARK-3155 > URL: https://issues.apache.org/jira/browse/SPARK-3155 > Project: Spark > Issue Type: Improvement > Components: MLlib > Reporter: Joseph K. Bradley > > Improvement: accuracy, computation > Summary: Pruning is a common method for preventing overfitting with decision > trees. A smart implementation can prune the tree during training in order to > avoid training parts of the tree which would be pruned eventually anyways. > DecisionTree does not currently support pruning. > Pruning: A “pruning” of a tree is a subtree with the same root node, but > with zero or more branches removed. > A naive implementation prunes as follows: > (1) Train a depth K tree using a training set. > (2) Compute the optimal prediction at each node (including internal nodes) > based on the training set. > (3) Take a held-out validation set, and use the tree to make predictions for > each validation example. This allows one to compute the validation error > made at each node in the tree (based on the predictions computed in step (2).) > (4) For each pair of leafs with the same parent, compare the total error on > the validation set made by the leafs’ predictions with the error made by the > parent’s predictions. Remove the leafs if the parent has lower error. > A smarter implementation prunes during training, computing the error on the > validation set made by each node as it is trained. Whenever two children > increase the validation error, they are pruned, and no more training is > required on that branch. > It is common to use about 1/3 of the data for pruning. Note that pruning is > important when using a tree directly for prediction. It is less important > when combining trees via ensemble methods. -- This message was sent by Atlassian JIRA (v6.2#6252) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org