2011/12/2 James Bergstra <[email protected]>: > I'm looking at the decision tree code and I'm not seeing any pruning > logic, or other logic to prevent over-fitting (other than requiring > that leaf nodes be sufficiently populated). Decision trees are not my > specialty, but pruning / early stopping seem often to be mentioned in > connection with trees. Should I add at least early stopping?
Hi James, currently, there are two ways to prevent over-fitting: a) limit the depth of the tree via `max_depth` or b) don't expand nodes that receives less than or equal to `min_split` samples. For me, tree pruning (e.g. via cost complexity pruning) has low priority since nowadays decision trees are hardly used in isolation but rather as weak learners in an ensemble (e.g. Random Forests or Boosting). These ensemble techniques usually don't rely on pruning but either use full grown trees and tackle over-fitting via averaging (e.g. RandomForests) or use use very shallow trees (e.g. trees of depth 2-4 are very common for boosting). > > On first glance, it looks like an alternative "_build_tree" would be > the way to go, that pulls out some validation examples to test the > split returned by find_split(). > - If validation error decreases, the split is good and validation > examples go back into the training set for the recursive calls. > - Otherwise we make this a leaf node. I'm not an expert in tree pruning but often tree pruning is just a post processing - so you first build the full tree (via `_build_tree`) and than prune it back. But I agree, pruning would be a very valuable contribution - I'd be glad to see any pull requests in the future! best, Peter -- Peter Prettenhofer ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
