Olivier,

In my experience GBRT usually requires more base learners than random
forests to get the same level of accuracy. I hardly use less than 100.
Regarding the poor performance of GBRT on the olivetti dataset:
multi-class GBRT fits ``k`` trees at each stage, thus, if you have
``n_estimators`` this means you have to grow ``k * n_estimators``
trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
used multi-class GBRT much (part of the reason is that GBM does not
support it) - I know that the learning to rank folks use multi-class
GBRT for ordinal scaled output values (e.g. "not-relevant",
"relevant", "highly relevant") but these involve usually less than 5
classes.

That said, the major drawback of GBRT is computational complexity: we
could speed up multi-class GBRT by training the ``k`` trees at each
stage in parallel but still it is much less efficient than random
forests. Maybe Scott (on CC) can comment on this as well - he has
worked on the multi-class support in GBRT and knows much more about
it.

@Olivier: it would be great if you could send me your benchmark script
so that I can look into the issue in more detail.

thanks,
 Peter

2012/3/25 Gilles Louppe <[email protected]>:
> Hi Olivier,
>
> The higher the number of estimators, the better. The more random the
> trees (e.g., the lower max_features), the more important it usually is
> to have a large forest to decrease the variance. To me, 10 is actually
> a very low default value. In my daily research, I deal with hundreds
> of trees. But yeah, it also takes longer.
>
> By the way I am curious, what kind of dataset are you testing those
> methods on? :)
>
> Gilles
>
> On 25 March 2012 03:49, Olivier Grisel <[email protected]> wrote:
>> Hi all,
>>
>> I have been playing a bit with GradientBoostingClassifier and
>> AdaBoostClassifier and ExtraTrees and while extra trees and adaboost
>> are reasonably fast to fit with there default params (n_estimators=10)
>> on a non toy dataset such as the olivetti faces dataset, the
>> GradientBoostingClassifier was taking ages (I killed it).
>>
>> The current default value is n_estimators=100 for
>> GradientBoostingClassifier. Maybe it should be aligned to
>> n_estimators=10 as in the other ensemble methods of the scikit?
>>
>> Or was I doing something very stupid by naively running it with the
>> default params on a dataset with size n_samples=400, n_features=4096
>> and n_classes=40 without any kind of preprocessing?
>>
>> Another way to rephrase that question: what is the typical sweet spot
>> for the dataset shape when doing classification Gradient Boosted
>> Trees? What are reasonable values for the number of estimators in
>> various application domains?
>>
>> --
>> Olivier
>> http://twitter.com/ogrisel - http://github.com/ogrisel
>>
>> ------------------------------------------------------------------------------
>> This SF email is sponsosred by:
>> Try Windows Azure free for 90 days Click Here
>> http://p.sf.net/sfu/sfd2d-msazure
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
> ------------------------------------------------------------------------------
> This SF email is sponsosred by:
> Try Windows Azure free for 90 days Click Here
> http://p.sf.net/sfu/sfd2d-msazure
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to