>
> A related question:
> I know you spent a lot of time tweaking the decision tree code.
> Do you think there might still be some potential there?

There's certainly room for improvement! I don't know if I speak for
Brian too but this was the first time I wrote a decision tree and I
did not look extensively at other implementations (well... I looked at
R's gbm package afterwards to find that it looks remarkably similar -
but it treats categorical and real valued features differently).

You can gain a lot if you can make some assumptions on the input data.
The current code is designed for data with a large number of potential
split points (real valued features) - if you expect few potential
split points (e.g. mostly categorical feature) than you would use
fundamentally different data structures (don't iterate over the sample
mask but use some form of skip list).

I remember a paper about scaling regression trees (and ultimately
gradient boosting for learning to rank applications) where the authors
proposed an adaptive binning procedure to make the (dominant) real
valued features discrete which allows more efficient search for split
points.

>
> In particular, would it be possible to use lists of pointers/indices
> instead of the
> masks to avoid iterating over masked out items? Or do you think
> that would create to much overhead?

The convenient point about the sample mask is that you need not update
your primary data structure (i.e. X_argsorted).

I think any significant performance leap in the decision tree would
ultimately require to change the data structure which is not a
low-hanging fruit.

>
> The current code works great for me (thanks for contributing!!!!),
> still it would mean a lot if I could make it even faster. At the moment
> it takes me
> about 8 hours to grow a tree with only a subset of the features
> that I actually want to use.... I have a 128 core cluster here but then
> building
> a forest with 1000 trees would still take roughly 6 days....

8 hours is quite a lot... there must be some room for improvement -
maybe we should schedule a profiling session :-)

But I'd also like to point out that I benched our code (not the latest
version tough) against WEKA and R's rpart -> the first didn't even
allow me to load the benchmark dataset (covertype) while the latter
was at least 10 times slower.

-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to