Re: [Scikit-learn-general] Default value n_estimators for GBRT

Olivier Grisel Sun, 25 Mar 2012 06:41:29 -0700

Le 25 mars 2012 12:44, Peter Prettenhofer
<[email protected]> a écrit :
> Olivier,
>
> In my experience GBRT usually requires more base learners than random
> forests to get the same level of accuracy. I hardly use less than 100.
> Regarding the poor performance of GBRT on the olivetti dataset:
> multi-class GBRT fits ``k`` trees at each stage, thus, if you have
> ``n_estimators`` this means you have to grow ``k * n_estimators``
> trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
> used multi-class GBRT much (part of the reason is that GBM does not
> support it) - I know that the learning to rank folks use multi-class
> GBRT for ordinal scaled output values (e.g. "not-relevant",
> "relevant", "highly relevant") but these involve usually less than 5
> classes.


Interesting I think this kind of practical considerations should be
added to the docs.

> That said, the major drawback of GBRT is computational complexity: we
> could speed up multi-class GBRT by training the ``k`` trees at each
> stage in parallel but still it is much less efficient than random
> forests. Maybe Scott (on CC) can comment on this as well - he has
> worked on the multi-class support in GBRT and knows much more about
> it.
>
> @Olivier: it would be great if you could send me your benchmark script
> so that I can look into the issue in more detail.

I just started an IPython session similar too:

>>> from sklearn.ensemble import *
>>> from sklearn.datasts import fetch_olivetti_faces
>>> from sklearn.cross_validation import cross_val_score
>>> olivetti = fetch_olivetti_faces
>>> cross_val_score(ExtraTreesClassifier(), olivetti.data, olivetti.target)
# should yield 3 scores around 0.85 in 10s
>>> cross_val_score(GradientBoostingClassifier(), olivetti.data, 
>>> olivetti.target)
# was too long to complete on my laptop

But indeed using GBRT on a 40 classes dataset is stupid in lights of
what you explained.

BTW: this paper on a workshop about the results of the Yahoo Learning
to Rank challenge is comparing GBRT and RF and their computational
complexity: very interesting (I read it after sending my question on
the mailing list...):

  http://jmlr.csail.mit.edu/proceedings/papers/v14/mohan11a/mohan11a.pdf

BTW Learning-to-Rank seems to be a very important application domain
that we do not cover well in scikit-learn. I think it would be great
to provide a dataset loader + maybe some sample feature extractor or
example script for point-wise & pair-wise setups (and maybe list-wise
too).  I wonder if it would be possible to make a small dataset
excerpt or generator suitable as a short example with fast execution
when building the doc with a CLI switch to use a larger chunk the
Yahoo or MSLR datasets for running the example in a more realistic
setting.

-- 
Olivier
http://twitter.com/ogrisel - http://github.com/ogrisel

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Default value n_estimators for GBRT

Reply via email to