[ 
https://issues.apache.org/jira/browse/SPARK-3155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14111396#comment-14111396
 ] 

Manish Amde commented on SPARK-3155:
------------------------------------

Hi Qiping, 

Thanks for creating the JIRA. Pruning is a desired feature for decision tree. 
Another related issue is https://issues.apache.org/jira/browse/SPARK-2207 which 
could possibly be tackled during this work. Let me know whether you are 
interested in this.

To start with, even naive pruning post tree construction would be a good start.

The smarter implementation would have be better than the naive implementation 
on two fronts:
1. Training time on most datasets
2. Accuracy

I would recommend the following:
a. Start with a naive implementation first and add it to the main codebase
b. Experiment with your suggested smart implementation. If we see significant 
improvements on both fronts, we could add a switch to enable smart pruning 
(possibly by default) during tree construction.

I am still not sure when to begin with the RF work. Could you give me day or 
two to figure this out. If I decide to work on it in the second half of 
September, may be you can get started on this work ASAP.

-Manish

> Support DecisionTree pruning
> ----------------------------
>
>                 Key: SPARK-3155
>                 URL: https://issues.apache.org/jira/browse/SPARK-3155
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: Joseph K. Bradley
>
> Improvement: accuracy, computation
> Summary: Pruning is a common method for preventing overfitting with decision 
> trees.  A smart implementation can prune the tree during training in order to 
> avoid training parts of the tree which would be pruned eventually anyways.  
> DecisionTree does not currently support pruning.
> Pruning:  A “pruning” of a tree is a subtree with the same root node, but 
> with zero or more branches removed.
> A naive implementation prunes as follows:
> (1) Train a depth K tree using a training set.
> (2) Compute the optimal prediction at each node (including internal nodes) 
> based on the training set.
> (3) Take a held-out validation set, and use the tree to make predictions for 
> each validation example.  This allows one to compute the validation error 
> made at each node in the tree (based on the predictions computed in step (2).)
> (4) For each pair of leafs with the same parent, compare the total error on 
> the validation set made by the leafs’ predictions with the error made by the 
> parent’s predictions.  Remove the leafs if the parent has lower error.
> A smarter implementation prunes during training, computing the error on the 
> validation set made by each node as it is trained.  Whenever two children 
> increase the validation error, they are pruned, and no more training is 
> required on that branch.
> It is common to use about 1/3 of the data for pruning.  Note that pruning is 
> important when using a tree directly for prediction.  It is less important 
> when combining trees via ensemble methods.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to