I'm looking at the decision tree code and I'm not seeing any pruning
logic, or other logic to prevent over-fitting (other than requiring
that leaf nodes be sufficiently populated).  Decision trees are not my
specialty, but pruning / early stopping seem often to be mentioned in
connection with trees. Should I add at least early stopping?

On first glance, it looks like an alternative "_build_tree" would be
the way to go, that pulls out some validation examples to test the
split returned by find_split().
- If validation error decreases, the split is good and validation
examples go back into the training set for the recursive calls.
- Otherwise we make this a leaf node.

I'd add one extra kwarg to BaseDecisionTree... something like

pruning = dict(algo=None)

the new code would be triggered by

pruning = dict(algo='early_stopping', min_valid_count=10, min_valid_frac=.2)

There would be an if statement (no registry or anything) in
BaseDecisionTree.fit that just looks at the pruning argument, and
either calls _build_tree or the new _build_tree_early_stopping.

If someone wants to add other pruning algorithms in the future, they'd
be welcome to follow the pattern, but I think there are not many such
algos, so piling them up in this file would be ok.

Thoughts?

- James

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to