2013/1/13 Erik Bernhardsson <[email protected]>: > Just a quick question about the gradient boosting in scikit-learn. We have > tons of data to regress on (like 100M data points), but the running time of > the algorithm is linear in the size of X no matter what subsample is set to.
Hi Erik, the problem pertains not to gradient boosting but to our (current) decision tree implementation. We use a bit mask (aka "sample_mask") to represent partitions of X. As you said, the algorithm is actually linear in len(X) but only considers rows of X for which sample_mask is True [1] - so ``subsample << 0.5`` should run faster than ``subsample == 1.0`` but its slower than passing X_subsample = X[np.random.rand(len(X)) > 0.5] directly to the fit method. When the ``sample_mask`` gets too sparse (i.e. too much entries are False) the algorithm spends most time checking the sample_mask -- not very efficient, hence, we use a heuristic to make sure that when the sample_mask gets too sparse (see ``min_density`` parameter) we copy X and discard all rows where sample mask is False [2] - this, however, results in both memory and runtime costs which have to be amortized. Since trees in gradient boosting are usually shallow I decided to turn off this heuristic (see [3]) - please try setting ``self.min_density = 0.1`` and test if you get a performance increase. If ``subsample`` is smaller than ``min_density``, each tree will trigger a copy of X. We (Brian, Gilles, Andy, and me) are not totally happy with our current sample_mask-based tree implementation - personally, I think it can be speed up considerably - but I think removing the sample_mask would require a complete re-write of the tree building procedure. The crux is to represent partitions efficiently while keeping auxiliary data structures (i.e. X_argsorted -- a DS that holds for each feature the list of examples sorted by ascending feature value) in sync. We have discussed various approaches to get rid of our "sample_mask" approach in this issue [4]. If you want to leverage the whole dataset (100M) you might want to explore a different approach as well: you could take a subsample (100k) and train a GBRT (e.g. 1000k trees) on that; then you can use this GBRT as a non-linear feature detector and augment each of the 100M examples with 1000 new features given by the output of each tree in the GBRT model. Now you can feed the new dataset into a linear model that scales to such a large dataset (e.g. vowpal wabbit). best, Peter [1] see function _smallest_sample_larger_than; https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1826 [2] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L511 [3] https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L563 [4] https://github.com/scikit-learn/scikit-learn/issues/964 (closed - discussion continues in https://github.com/scikit-learn/scikit-learn/issues/1435 ) > Right now we just sample say 100k data points and run gradient boosting on > it, but it would be nice if we can use a much larger data set. > > See > https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L587 > for the code – basically instead of subsampling, the algorithm just creates > a random binary mask. > > It would be nice if it was linear in the len(X) * subsample because then we > could set subsample to a very small number and use a lot more data points. > That should reduce overfitting with no disadvantages really (afaik). I'm new > to gradient boosting and I don't know it that well. Is there a fundamental > reason why you can't make it linear in len(X) * subsample? Otherwise I might > try to put together a patch for it. > > Thanks! > > ------------------------------------------------------------------------------ > Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, > MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current > with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft > MVPs and experts. SALE $99.99 this month only -- learn more at: > http://p.sf.net/sfu/learnmore_122412 > _______________________________________________ > Scikit-learn-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Peter Prettenhofer ------------------------------------------------------------------------------ Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS, MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft MVPs and experts. SALE $99.99 this month only -- learn more at: http://p.sf.net/sfu/learnmore_122412 _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
