Re: [Scikit-learn-general] Default value n_estimators for GBRT

Peter Prettenhofer Sun, 25 Mar 2012 08:10:06 -0700

2012/3/25 Olivier Grisel <[email protected]>:
> Le 25 mars 2012 12:44, Peter Prettenhofer
> <[email protected]> a écrit :
>> Olivier,
>>
>> In my experience GBRT usually requires more base learners than random
>> forests to get the same level of accuracy. I hardly use less than 100.
>> Regarding the poor performance of GBRT on the olivetti dataset:
>> multi-class GBRT fits ``k`` trees at each stage, thus, if you have
>> ``n_estimators`` this means you have to grow ``k * n_estimators``
>> trees in total (4000 trees is quite a lot :-) ). Personally, I haven't
>> used multi-class GBRT much (part of the reason is that GBM does not
>> support it) - I know that the learning to rank folks use multi-class
>> GBRT for ordinal scaled output values (e.g. "not-relevant",
>> "relevant", "highly relevant") but these involve usually less than 5
>> classes.
>
> Interesting I think this kind of practical considerations should be
> added to the docs.


Absolutely - I'll add them immediately.

>
>> That said, the major drawback of GBRT is computational complexity: we
>> could speed up multi-class GBRT by training the ``k`` trees at each
>> stage in parallel but still it is much less efficient than random
>> forests. Maybe Scott (on CC) can comment on this as well - he has
>> worked on the multi-class support in GBRT and knows much more about
>> it.
>>
>> @Olivier: it would be great if you could send me your benchmark script
>> so that I can look into the issue in more detail.
>
> I just started an IPython session similar too:
>
>>>> from sklearn.ensemble import *
>>>> from sklearn.datasts import fetch_olivetti_faces
>>>> from sklearn.cross_validation import cross_val_score
>>>> olivetti = fetch_olivetti_faces
>>>> cross_val_score(ExtraTreesClassifier(), olivetti.data, olivetti.target)
> # should yield 3 scores around 0.85 in 10s
>>>> cross_val_score(GradientBoostingClassifier(), olivetti.data, 
>>>> olivetti.target)
> # was too long to complete on my laptop
>
> But indeed using GBRT on a 40 classes dataset is stupid in lights of
> what you explained.

Unfortunately. As soon as I have some spare time I'll write a PR that
uses Joblib to parallelize tree induction in the case of multi-class
classification.

>
> BTW: this paper on a workshop about the results of the Yahoo Learning
> to Rank challenge is comparing GBRT and RF and their computational
> complexity: very interesting (I read it after sending my question on
> the mailing list...):
>
>  http://jmlr.csail.mit.edu/proceedings/papers/v14/mohan11a/mohan11a.pdf

Thx - I remember the paper but I haven't read it in depth - The master
thesis of Ananth Mohan provides some implementation detail for
regression tree induction and gradient boosting - highly recommended!

>
> BTW Learning-to-Rank seems to be a very important application domain
> that we do not cover well in scikit-learn. I think it would be great
> to provide a dataset loader + maybe some sample feature extractor or
> example script for point-wise & pair-wise setups (and maybe list-wise
> too).  I wonder if it would be possible to make a small dataset
> excerpt or generator suitable as a short example with fast execution
> when building the doc with a CLI switch to use a larger chunk the
> Yahoo or MSLR datasets for running the example in a more realistic
> setting.

Yeah - unfortunately one needs to register in order to download the
Yahoo LTRC dataset but I think we could use LETOR as a prove of
concept dataset. There are a number of issues though:

 1. We need to support the query id (=``qid``) field in
``svmlight_loader``; Pair-wise approaches such as RankingSVMs need
this information to form example pairs. My personal experience is that
RankingSVMs do surprisingly poor on learning-to-rank problems - but
they are great for binary classification problems with highly skewed
class distributions [1] (in this case we also don't need qids) - so
they would be a great addition to sklearn. Point-wise approaches don't
take query affiliation into account so we don't need to expose the
qids.

2. We need ranking evaluation metrics such as (N)DCG,  average
precision and precision at k - but these metrics need to take the
query ids into account. Thus, we need do modify our some of our
existing metrics.


BTW: We might implement a SGD-based ranking svm simply by creating a
new ``sklearn.utils.seq_dataset.Dataset`` subclass that implements
``next`` such that it samples a positive and a negative example from
the training set and creates a new feature vector on the fly - so we
don't need to create the cross-product of examples explicitly -
something along the lines of D. Sculleys work [2].


[1] D. Sculley et al. Detecting Adversarial Advertisements in the Wild, KDD 2011
[2] D. Sculley, Large-scale learning to rank, NIPS 2009,
http://www.eecs.tufts.edu/~dsculley/papers/large-scale-rank.pdf

-- 
Peter Prettenhofer

------------------------------------------------------------------------------
This SF email is sponsosred by:
Try Windows Azure free for 90 days Click Here 
http://p.sf.net/sfu/sfd2d-msazure
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Default value n_estimators for GBRT

Reply via email to