I'm looking at the decision tree code and I'm not seeing any pruning logic, or other logic to prevent over-fitting (other than requiring that leaf nodes be sufficiently populated). Decision trees are not my specialty, but pruning / early stopping seem often to be mentioned in connection with trees. Should I add at least early stopping?
On first glance, it looks like an alternative "_build_tree" would be the way to go, that pulls out some validation examples to test the split returned by find_split(). - If validation error decreases, the split is good and validation examples go back into the training set for the recursive calls. - Otherwise we make this a leaf node. I'd add one extra kwarg to BaseDecisionTree... something like pruning = dict(algo=None) the new code would be triggered by pruning = dict(algo='early_stopping', min_valid_count=10, min_valid_frac=.2) There would be an if statement (no registry or anything) in BaseDecisionTree.fit that just looks at the pruning argument, and either calls _build_tree or the new _build_tree_early_stopping. If someone wants to add other pruning algorithms in the future, they'd be welcome to follow the pattern, but I think there are not many such algos, so piling them up in this file would be ok. Thoughts? - James ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
