Would anyone be interested in an implementation of a ensemble decision-tree
classifier that can be trained incrementally, or know why one is not
already included in sklearn?

I have a dataset that can be segmented very cleanly by a decision tree, but
it's too large to fit into memory, so I couldn't directly apply the
DecisionTreeClassifier or RandomForestClassifier, since they don't
implement partial_fit().

So I wrote a simple implementation that maintains a forest of
DecisionTreeClassifier instances, training a new instance every time
partial_fit() is called, keeping the N best according to the value returned
by score(). The methods predict() and predict_proba() work similarly to
those in the other ensemble tree classifiers.

In my initial tests, performance and accuracy look good, but obviously it
will likely never be more accurate than a single DecisionTreeClassifier
when all your data can fit into memory. Does anyone know of any other
potential caveats to this approach?
------------------------------------------------------------------------------
Learn Graph Databases - Download FREE O'Reilly Book
"Graph Databases" is the definitive new guide to graph databases and their
applications. Written by three acclaimed leaders in the field,
this first edition is now available. Download your free book today!
http://p.sf.net/sfu/NeoTech
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to