2012/3/25 Olivier Grisel <[email protected]>: > Le 25 mars 2012 12:44, Peter Prettenhofer > <[email protected]> a écrit : >> Olivier, >> >> In my experience GBRT usually requires more base learners than random >> forests to get the same level of accuracy. I hardly use less than 100. >> Regarding the poor performance of GBRT on the olivetti dataset: >> multi-class GBRT fits ``k`` trees at each stage, thus, if you have >> ``n_estimators`` this means you have to grow ``k * n_estimators`` >> trees in total (4000 trees is quite a lot :-) ). Personally, I haven't >> used multi-class GBRT much (part of the reason is that GBM does not >> support it) - I know that the learning to rank folks use multi-class >> GBRT for ordinal scaled output values (e.g. "not-relevant", >> "relevant", "highly relevant") but these involve usually less than 5 >> classes. > > Interesting I think this kind of practical considerations should be > added to the docs.
Absolutely - I'll add them immediately. > >> That said, the major drawback of GBRT is computational complexity: we >> could speed up multi-class GBRT by training the ``k`` trees at each >> stage in parallel but still it is much less efficient than random >> forests. Maybe Scott (on CC) can comment on this as well - he has >> worked on the multi-class support in GBRT and knows much more about >> it. >> >> @Olivier: it would be great if you could send me your benchmark script >> so that I can look into the issue in more detail. > > I just started an IPython session similar too: > >>>> from sklearn.ensemble import * >>>> from sklearn.datasts import fetch_olivetti_faces >>>> from sklearn.cross_validation import cross_val_score >>>> olivetti = fetch_olivetti_faces >>>> cross_val_score(ExtraTreesClassifier(), olivetti.data, olivetti.target) > # should yield 3 scores around 0.85 in 10s >>>> cross_val_score(GradientBoostingClassifier(), olivetti.data, >>>> olivetti.target) > # was too long to complete on my laptop > > But indeed using GBRT on a 40 classes dataset is stupid in lights of > what you explained. Unfortunately. As soon as I have some spare time I'll write a PR that uses Joblib to parallelize tree induction in the case of multi-class classification. > > BTW: this paper on a workshop about the results of the Yahoo Learning > to Rank challenge is comparing GBRT and RF and their computational > complexity: very interesting (I read it after sending my question on > the mailing list...): > > http://jmlr.csail.mit.edu/proceedings/papers/v14/mohan11a/mohan11a.pdf Thx - I remember the paper but I haven't read it in depth - The master thesis of Ananth Mohan provides some implementation detail for regression tree induction and gradient boosting - highly recommended! > > BTW Learning-to-Rank seems to be a very important application domain > that we do not cover well in scikit-learn. I think it would be great > to provide a dataset loader + maybe some sample feature extractor or > example script for point-wise & pair-wise setups (and maybe list-wise > too). I wonder if it would be possible to make a small dataset > excerpt or generator suitable as a short example with fast execution > when building the doc with a CLI switch to use a larger chunk the > Yahoo or MSLR datasets for running the example in a more realistic > setting. Yeah - unfortunately one needs to register in order to download the Yahoo LTRC dataset but I think we could use LETOR as a prove of concept dataset. There are a number of issues though: 1. We need to support the query id (=``qid``) field in ``svmlight_loader``; Pair-wise approaches such as RankingSVMs need this information to form example pairs. My personal experience is that RankingSVMs do surprisingly poor on learning-to-rank problems - but they are great for binary classification problems with highly skewed class distributions [1] (in this case we also don't need qids) - so they would be a great addition to sklearn. Point-wise approaches don't take query affiliation into account so we don't need to expose the qids. 2. We need ranking evaluation metrics such as (N)DCG, average precision and precision at k - but these metrics need to take the query ids into account. Thus, we need do modify our some of our existing metrics. BTW: We might implement a SGD-based ranking svm simply by creating a new ``sklearn.utils.seq_dataset.Dataset`` subclass that implements ``next`` such that it samples a positive and a negative example from the training set and creates a new feature vector on the fly - so we don't need to create the cross-product of examples explicitly - something along the lines of D. Sculleys work [2]. [1] D. Sculley et al. Detecting Adversarial Advertisements in the Wild, KDD 2011 [2] D. Sculley, Large-scale learning to rank, NIPS 2009, http://www.eecs.tufts.edu/~dsculley/papers/large-scale-rank.pdf -- Peter Prettenhofer ------------------------------------------------------------------------------ This SF email is sponsosred by: Try Windows Azure free for 90 days Click Here http://p.sf.net/sfu/sfd2d-msazure _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
