I have noticed also that GBRT can be quite slow even for small k.
Since GBRT fits n*k decision trees, by design, it's crucial that the
decision tree code be highly optimized. I did some quick performance
profiling the other day, which showed that the performance bottleneck
was spent in the method  _find_best_split.

Looking into that method I found what I think is a major inefficiency
that needs to be addressed. Nested within a for loop is a 'while True'
that iterates over the splitting points for a given feature to test
for the best split. One problem I see is that the splitting points are
not indexed ahead of time so a lot of time is wasted finding the same
splitting points over and over again. By allowing the splitting points
to be indexed ahead of time, I believe the performance bottleneck of
finding the splits would would be reduced by factor of O(m) to O(u).
This should have a marked improvement on datasets that have a lot of
repeating values. Of course, this would come at the additional cost of
storing those indexed splitting points.

Notation I am using:
n = # iterations
k = # classes
m = # samples
u = max # of unique values across all features

This is from a cursory inspection as I have not had enough time to
study this code as I would like so feel free to correct me if I'm
wrong here.

cheers,
Scott

On Sun, Mar 25, 2012 at 3:44 AM, Peter Prettenhofer
<[email protected]> wrote:
> Olivier,
>
> In my experience GBRT usually requires more base learners than random
> forests to get the same level of accuracy. I hardly use less than 100.
> Regarding the poor performance of GBRT on the olivetti dataset:
> multi-class GBRT fits ``k`` trees at each stage, thus, if you have
> ``n_estimators`` this means you have to grow ``k * n_estimators``
> trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
> used multi-class GBRT much (part of the reason is that GBM does not
> support it) - I know that the learning to rank folks use multi-class
> GBRT for ordinal scaled output values (e.g. "not-relevant",
> "relevant", "highly relevant") but these involve usually less than 5
> classes.
>
> That said, the major drawback of GBRT is computational complexity: we
> could speed up multi-class GBRT by training the ``k`` trees at each
> stage in parallel but still it is much less efficient than random
> forests. Maybe Scott (on CC) can comment on this as well - he has
> worked on the multi-class support in GBRT and knows much more about
> it.
>
> @Olivier: it would be great if you could send me your benchmark script
> so that I can look into the issue in more detail.
>
> thanks,
>  Peter
>
> 2012/3/25 Gilles Louppe <[email protected]>:
>> Hi Olivier,
>>
>> The higher the number of estimators, the better. The more random the
>> trees (e.g., the lower max_features), the more important it usually is
>> to have a large forest to decrease the variance. To me, 10 is actually
>> a very low default value. In my daily research, I deal with hundreds
>> of trees. But yeah, it also takes longer.
>>
>> By the way I am curious, what kind of dataset are you testing those
>> methods on? :)
>>
>> Gilles
>>
>> On 25 March 2012 03:49, Olivier Grisel <[email protected]> wrote:
>>> Hi all,
>>>
>>> I have been playing a bit with GradientBoostingClassifier and
>>> AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
>>> are reasonably fast to fit with there default params (n_estimators=10)
>>> on a non toy dataset such as the olivetti faces dataset, the
>>> GradientBoostingClassifier was taking ages (I killed it).
>>>
>>> The current default value is n_estimators=100 for
>>> GradientBoostingClassifier. Maybe it should be aligned to
>>> n_estimators=10 as in the other ensemble methods of the scikit?
>>>
>>> Or was I doing something very stupid by naively running it with the
>>> default params on a dataset with size n_samples=400, n_features=4096
>>> and n_classes=40 without any kind of preprocessing?
>>>
>>> Another way to rephrase that question: what is the typical sweet spot
>>> for the dataset shape when doing classification Gradient Boosted
>>> Trees? What are reasonable values for the number of estimators in
>>> various application domains?
>>>
>>> --
>>> Olivier
>>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> --
> Peter Prettenhofer

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to