> > A related question: > I know you spent a lot of time tweaking the decision tree code. > Do you think there might still be some potential there?
There's certainly room for improvement! I don't know if I speak for Brian too but this was the first time I wrote a decision tree and I did not look extensively at other implementations (well... I looked at R's gbm package afterwards to find that it looks remarkably similar - but it treats categorical and real valued features differently). You can gain a lot if you can make some assumptions on the input data. The current code is designed for data with a large number of potential split points (real valued features) - if you expect few potential split points (e.g. mostly categorical feature) than you would use fundamentally different data structures (don't iterate over the sample mask but use some form of skip list). I remember a paper about scaling regression trees (and ultimately gradient boosting for learning to rank applications) where the authors proposed an adaptive binning procedure to make the (dominant) real valued features discrete which allows more efficient search for split points. > > In particular, would it be possible to use lists of pointers/indices > instead of the > masks to avoid iterating over masked out items? Or do you think > that would create to much overhead? The convenient point about the sample mask is that you need not update your primary data structure (i.e. X_argsorted). I think any significant performance leap in the decision tree would ultimately require to change the data structure which is not a low-hanging fruit. > > The current code works great for me (thanks for contributing!!!!), > still it would mean a lot if I could make it even faster. At the moment > it takes me > about 8 hours to grow a tree with only a subset of the features > that I actually want to use.... I have a 128 core cluster here but then > building > a forest with 1000 trees would still take roughly 6 days.... 8 hours is quite a lot... there must be some room for improvement - maybe we should schedule a profiling session :-) But I'd also like to point out that I benched our code (not the latest version tough) against WEKA and R's rpart -> the first didn't even allow me to load the benchmark dataset (covertype) while the latter was at least 10 times slower. -- Peter Prettenhofer ------------------------------------------------------------------------------ Write once. Port to many. Get the SDK and tools to simplify cross-platform app development. Create new or port existing apps to sell to consumers worldwide. Explore the Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join http://p.sf.net/sfu/intel-appdev _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
