2012/3/25 Scott White <[email protected]>:
> I have noticed also that GBRT can be quite slow even for small k.
> Since GBRT fits n*k decision trees, by design, it's crucial that the
> decision tree code be highly optimized. I did some quick performance
> profiling the other day, which showed that the performance bottleneck
> was spent in the method  _find_best_split.
>
> Looking into that method I found what I think is a major inefficiency
> that needs to be addressed. Nested within a for loop is a 'while True'
> that iterates over the splitting points for a given feature to test
> for the best split. One problem I see is that the splitting points are
> not indexed ahead of time so a lot of time is wasted finding the same
> splitting points over and over again. By allowing the splitting points
> to be indexed ahead of time, I believe the performance bottleneck of
> finding the splits would would be reduced by factor of O(m) to O(u).
> This should have a marked improvement on datasets that have a lot of
> repeating values. Of course, this would come at the additional cost of
> storing those indexed splitting points.

I totally agree - the current implementation is tailored towards data
sets with numerical features (and lots of potential split points) -
the ``sample_mask`` approach indeed does poorly on data sets with
categorical features. On the other hand, the indexing approach has
significant pre-processing costs and the cost of updating the data
structure is much higher too - so I assume that there is a break-even
somewhere and at the end of the day it really depends on the data you
are working with.

I'm currently using the GBRT classifier in a setting which involves
both numerical features and indicator features - so I would really
appreciate a performance enhancement here! On the other hand, when
speaking about performance issues which should not focus solely on
training time performance! In fact, currently I'm much more concerned
about test time performance which limits the applicability of sklearn
for certain scenarios (e.g. low latency scenarios such as web apps or
real-time CV or using sklearn classifiers as building blocks in greedy
decoders for NLP applications).

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to