Duc Trung NGUYEN created SPARK-9190:
---------------------------------------

             Summary: CLONE - Support DecisionTree pruning
                 Key: SPARK-9190
                 URL: https://issues.apache.org/jira/browse/SPARK-9190
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
            Reporter: Duc Trung NGUYEN


Improvement: accuracy, computation

Summary: Pruning is a common method for preventing overfitting with decision 
trees.  A smart implementation can prune the tree during training in order to 
avoid training parts of the tree which would be pruned eventually anyways.  
DecisionTree does not currently support pruning.

Pruning:  A “pruning” of a tree is a subtree with the same root node, but with 
zero or more branches removed.
A naive implementation prunes as follows:
(1) Train a depth K tree using a training set.
(2) Compute the optimal prediction at each node (including internal nodes) 
based on the training set.
(3) Take a held-out validation set, and use the tree to make predictions for 
each validation example.  This allows one to compute the validation error made 
at each node in the tree (based on the predictions computed in step (2).)
(4) For each pair of leafs with the same parent, compare the total error on the 
validation set made by the leafs’ predictions with the error made by the 
parent’s predictions.  Remove the leafs if the parent has lower error.

A smarter implementation prunes during training, computing the error on the 
validation set made by each node as it is trained.  Whenever two children 
increase the validation error, they are pruned, and no more training is 
required on that branch.

It is common to use about 1/3 of the data for pruning.  Note that pruning is 
important when using a tree directly for prediction.  It is less important when 
combining trees via ensemble methods.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to