I have noticed also that GBRT can be quite slow even for small k. Since GBRT fits n*k decision trees, by design, it's crucial that the decision tree code be highly optimized. I did some quick performance profiling the other day, which showed that the performance bottleneck was spent in the method _find_best_split.
Looking into that method I found what I think is a major inefficiency that needs to be addressed. Nested within a for loop is a 'while True' that iterates over the splitting points for a given feature to test for the best split. One problem I see is that the splitting points are not indexed ahead of time so a lot of time is wasted finding the same splitting points over and over again. By allowing the splitting points to be indexed ahead of time, I believe the performance bottleneck of finding the splits would would be reduced by factor of O(m) to O(u). This should have a marked improvement on datasets that have a lot of repeating values. Of course, this would come at the additional cost of storing those indexed splitting points. Notation I am using: n = # iterations k = # classes m = # samples u = max # of unique values across all features This is from a cursory inspection as I have not had enough time to study this code as I would like so feel free to correct me if I'm wrong here. cheers, Scott On Sun, Mar 25, 2012 at 3:44 AM, Peter Prettenhofer <[email protected]> wrote: > Olivier, > > In my experience GBRT usually requires more base learners than random > forests to get the same level of accuracy. I hardly use less than 100. > Regarding the poor performance of GBRT on the olivetti dataset: > multi-class GBRT fits ``k`` trees at each stage, thus, if you have > ``n_estimators`` this means you have to grow ``k * n_estimators`` > trees in total (4000 trees is quite a lot :-) ). Personally, I haven't > used multi-class GBRT much (part of the reason is that GBM does not > support it) - I know that the learning to rank folks use multi-class > GBRT for ordinal scaled output values (e.g. "not-relevant", > "relevant", "highly relevant") but these involve usually less than 5 > classes. > > That said, the major drawback of GBRT is computational complexity: we > could speed up multi-class GBRT by training the ``k`` trees at each > stage in parallel but still it is much less efficient than random > forests. Maybe Scott (on CC) can comment on this as well - he has > worked on the multi-class support in GBRT and knows much more about > it. > > @Olivier: it would be great if you could send me your benchmark script > so that I can look into the issue in more detail. > > thanks, > Peter > > 2012/3/25 Gilles Louppe <[email protected]>: >> Hi Olivier, >> >> The higher the number of estimators, the better. The more random the >> trees (e.g., the lower max_features), the more important it usually is >> to have a large forest to decrease the variance. To me, 10 is actually >> a very low default value. In my daily research, I deal with hundreds >> of trees. But yeah, it also takes longer. >> >> By the way I am curious, what kind of dataset are you testing those >> methods on? :) >> >> Gilles >> >> On 25 March 2012 03:49, Olivier Grisel <[email protected]> wrote: >>> Hi all, >>> >>> I have been playing a bit with GradientBoostingClassifier and >>> AdaBoostClassifier and ExtraTrees and while extra trees and adaboost >>> are reasonably fast to fit with there default params (n_estimators=10) >>> on a non toy dataset such as the olivetti faces dataset, the >>> GradientBoostingClassifier was taking ages (I killed it). >>> >>> The current default value is n_estimators=100 for >>> GradientBoostingClassifier. Maybe it should be aligned to >>> n_estimators=10 as in the other ensemble methods of the scikit? >>> >>> Or was I doing something very stupid by naively running it with the >>> default params on a dataset with size n_samples=400, n_features=4096 >>> and n_classes=40 without any kind of preprocessing? >>> >>> Another way to rephrase that question: what is the typical sweet spot >>> for the dataset shape when doing classification Gradient Boosted >>> Trees? What are reasonable values for the number of estimators in >>> various application domains? >>> >>> -- >>> Olivier >>> http://twitter.com/ogrisel - http://github.com/ogrisel >>> >>> ------------------------------------------------------------------------------ >>> This SF email is sponsosred by: >>> Try Windows Azure free for 90 days Click Here >>> http://p.sf.net/sfu/sfd2d-msazure >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> [email protected] >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >> ------------------------------------------------------------------------------ >> This SF email is sponsosred by: >> Try Windows Azure free for 90 days Click Here >> http://p.sf.net/sfu/sfd2d-msazure >> _______________________________________________ >> Scikit-learn-general mailing list >> [email protected] >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > -- > Peter Prettenhofer ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
