Btw, another thing I think we should add is the ability to monitor the
out-of-bag estimates after each iteration and allow the fitting to be
terminated early. It's usually hard to guess the right number of
iterations required and if one can terminate the fitting early based
on good oob results that can help speed things up a lot.

cheers,
Scott

On Sun, Mar 25, 2012 at 10:10 AM, Scott White <[email protected]> wrote:
> I have noticed also that GBRT can be quite slow even for small k.
> Since GBRT fits n*k decision trees, by design, it's crucial that the
> decision tree code be highly optimized. I did some quick performance
> profiling the other day, which showed that the performance bottleneck
> was spent in the method  _find_best_split.
>
> Looking into that method I found what I think is a major inefficiency
> that needs to be addressed. Nested within a for loop is a 'while True'
> that iterates over the splitting points for a given feature to test
> for the best split. One problem I see is that the splitting points are
> not indexed ahead of time so a lot of time is wasted finding the same
> splitting points over and over again. By allowing the splitting points
> to be indexed ahead of time, I believe the performance bottleneck of
> finding the splits would would be reduced by factor of O(m) to O(u).
> This should have a marked improvement on datasets that have a lot of
> repeating values. Of course, this would come at the additional cost of
> storing those indexed splitting points.
>
> Notation I am using:
> n = # iterations
> k = # classes
> m = # samples
> u = max # of unique values across all features
>
> This is from a cursory inspection as I have not had enough time to
> study this code as I would like so feel free to correct me if I'm
> wrong here.
>
> cheers,
> Scott
>
> On Sun, Mar 25, 2012 at 3:44 AM, Peter Prettenhofer
> <[email protected]> wrote:
>> Olivier,
>>
>> In my experience GBRT usually requires more base learners than random
>> forests to get the same level of accuracy. I hardly use less than 100.
>> Regarding the poor performance of GBRT on the olivetti dataset:
>> multi-class GBRT fits ``k`` trees at each stage, thus, if you have
>> ``n_estimators`` this means you have to grow ``k * n_estimators``
>> trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
>> used multi-class GBRT much (part of the reason is that GBM does not
>> support it) - I know that the learning to rank folks use multi-class
>> GBRT for ordinal scaled output values (e.g. "not-relevant",
>> "relevant", "highly relevant") but these involve usually less than 5
>> classes.
>>
>> That said, the major drawback of GBRT is computational complexity: we
>> could speed up multi-class GBRT by training the ``k`` trees at each
>> stage in parallel but still it is much less efficient than random
>> forests. Maybe Scott (on CC) can comment on this as well - he has
>> worked on the multi-class support in GBRT and knows much more about
>> it.
>>
>> @Olivier: it would be great if you could send me your benchmark script
>> so that I can look into the issue in more detail.
>>
>> thanks,
>>  Peter
>>
>> 2012/3/25 Gilles Louppe <[email protected]>:
>>> Hi Olivier,
>>>
>>> The higher the number of estimators, the better. The more random the
>>> trees (e.g., the lower max_features), the more important it usually is
>>> to have a large forest to decrease the variance. To me, 10 is actually
>>> a very low default value. In my daily research, I deal with hundreds
>>> of trees. But yeah, it also takes longer.
>>>
>>> By the way I am curious, what kind of dataset are you testing those
>>> methods on? :)
>>>
>>> Gilles
>>>
>>> On 25 March 2012 03:49, Olivier Grisel <[email protected]> wrote:
>>>> Hi all,
>>>>
>>>> I have been playing a bit with GradientBoostingClassifier and
>>>> AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
>>>> are reasonably fast to fit with there default params (n_estimators=10)
>>>> on a non toy dataset such as the olivetti faces dataset, the
>>>> GradientBoostingClassifier was taking ages (I killed it).
>>>>
>>>> The current default value is n_estimators=100 for
>>>> GradientBoostingClassifier. Maybe it should be aligned to
>>>> n_estimators=10 as in the other ensemble methods of the scikit?
>>>>
>>>> Or was I doing something very stupid by naively running it with the
>>>> default params on a dataset with size n_samples=400, n_features=4096
>>>> and n_classes=40 without any kind of preprocessing?
>>>>
>>>> Another way to rephrase that question: what is the typical sweet spot
>>>> for the dataset shape when doing classification Gradient Boosted
>>>> Trees? What are reasonable values for the number of estimators in
>>>> various application domains?
>>>>
>>>> --
>>>> Olivier
>>>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>>>
>>>> ------------------------------------------------------------------------------
>>>> This SF email is sponsosred by:
>>>> Try Windows Azure free for 90 days Click Here
>>>> http://p.sf.net/sfu/sfd2d-msazure
>>>> _______________________________________________
>>>> Scikit-learn-general mailing list
>>>> [email protected]
>>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>>
>>> ------------------------------------------------------------------------------
>>> This SF email is sponsosred by:
>>> Try Windows Azure free for 90 days Click Here
>>> http://p.sf.net/sfu/sfd2d-msazure
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> [email protected]
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>> --
>> Peter Prettenhofer

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to