[
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14623719#comment-14623719
]
Walter Petersen commented on SPARK-3155:
----------------------------------------
Ok, fine. Thanks a lot.
> Support DecisionTree pruning
> ----------------------------
>
> Key: SPARK-3155
> URL: https://issues.apache.org/jira/browse/SPARK-3155
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision
> trees. A smart implementation can prune the tree during training in order to
> avoid training parts of the tree which would be pruned eventually anyways.
> DecisionTree does not currently support pruning.
> Pruning: A “pruning” of a tree is a subtree with the same root node, but
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes)
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for
> each validation example. This allows one to compute the validation error
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on
> the validation set made by the leafs’ predictions with the error made by the
> parent’s predictions. Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the
> validation set made by each node as it is trained. Whenever two children
> increase the validation error, they are pruned, and no more training is
> required on that branch.
> It is common to use about 1/3 of the data for pruning. Note that pruning is
> important when using a tree directly for prediction. It is less important
> when combining trees via ensemble methods.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]