2012/3/25 Scott White <[email protected]>: > I have noticed also that GBRT can be quite slow even for small k. > Since GBRT fits n*k decision trees, by design, it's crucial that the > decision tree code be highly optimized. I did some quick performance > profiling the other day, which showed that the performance bottleneck > was spent in the method _find_best_split. > > Looking into that method I found what I think is a major inefficiency > that needs to be addressed. Nested within a for loop is a 'while True' > that iterates over the splitting points for a given feature to test > for the best split. One problem I see is that the splitting points are > not indexed ahead of time so a lot of time is wasted finding the same > splitting points over and over again. By allowing the splitting points > to be indexed ahead of time, I believe the performance bottleneck of > finding the splits would would be reduced by factor of O(m) to O(u). > This should have a marked improvement on datasets that have a lot of > repeating values. Of course, this would come at the additional cost of > storing those indexed splitting points.
I totally agree - the current implementation is tailored towards data sets with numerical features (and lots of potential split points) - the ``sample_mask`` approach indeed does poorly on data sets with categorical features. On the other hand, the indexing approach has significant pre-processing costs and the cost of updating the data structure is much higher too - so I assume that there is a break-even somewhere and at the end of the day it really depends on the data you are working with. I'm currently using the GBRT classifier in a setting which involves both numerical features and indicator features - so I would really appreciate a performance enhancement here! On the other hand, when speaking about performance issues which should not focus solely on training time performance! In fact, currently I'm much more concerned about test time performance which limits the applicability of sklearn for certain scenarios (e.g. low latency scenarios such as web apps or real-time CV or using sklearn classifiers as building blocks in greedy decoders for NLP applications). ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
