Re: [Scikit-learn-general] Gradient boosting complexity

Peter Prettenhofer Mon, 14 Jan 2013 02:52:19 -0800

2013/1/13 Erik Bernhardsson <[email protected]>:
> Just a quick question about the gradient boosting in scikit-learn. We have
> tons of data to regress on (like 100M data points), but the running time of
> the algorithm is linear in the size of X no matter what subsample is set to.


Hi Erik,

the problem pertains not to gradient boosting but to our (current)
decision tree implementation. We use a bit mask (aka "sample_mask") to
represent partitions of X. As you said, the algorithm is actually
linear in len(X) but only considers rows of X for which sample_mask is
True [1] - so ``subsample << 0.5`` should run faster than ``subsample
== 1.0`` but its slower than passing X_subsample =
X[np.random.rand(len(X)) > 0.5] directly to the fit method.

When the ``sample_mask`` gets too sparse (i.e. too much entries are
False) the algorithm spends most time checking the sample_mask -- not
very efficient, hence, we use a heuristic to make sure that when the
sample_mask gets too sparse (see ``min_density`` parameter) we copy X
and discard all rows where sample mask is False [2] - this, however,
results in both memory and runtime costs which have to be amortized.

Since trees in gradient boosting are usually shallow I decided to turn
off this heuristic (see [3]) - please try setting ``self.min_density =
0.1`` and test if you get a performance increase. If ``subsample`` is
smaller than ``min_density``, each tree will trigger a copy of X.

We (Brian, Gilles, Andy, and me) are not totally happy with our
current sample_mask-based tree implementation - personally, I think it
can be speed up considerably - but I think removing the sample_mask
would require a complete re-write of the tree building procedure.
The crux is to represent partitions efficiently while keeping
auxiliary data structures (i.e. X_argsorted -- a DS that holds for
each feature the list of examples sorted by ascending feature value)
in sync.
We have discussed various approaches to get rid of our "sample_mask"
approach in this issue [4].

If you want to leverage the whole dataset (100M) you might want to
explore a different approach as well: you could take a subsample
(100k) and train a GBRT (e.g. 1000k trees) on that; then you can use
this GBRT as a non-linear feature detector and augment each of the
100M examples with 1000 new features given by the output of each tree
in the GBRT model. Now you can feed the new dataset into a linear
model that scales to such a large dataset (e.g. vowpal wabbit).

best,
 Peter


[1] see function _smallest_sample_larger_than;
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L1826
[2] 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_tree.pyx#L511
[3] 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L563
[4] https://github.com/scikit-learn/scikit-learn/issues/964 (closed -
discussion continues in
https://github.com/scikit-learn/scikit-learn/issues/1435 )

> Right now we just sample say 100k data points and run gradient boosting on
> it, but it would be nice if we can use a much larger data set.
>
> See
> https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L587
> for the code – basically instead of subsampling, the algorithm just creates
> a random binary mask.
>
> It would be nice if it was linear in the len(X) * subsample because then we
> could set subsample to a very small number and use a lot more data points.
> That should reduce overfitting with no disadvantages really (afaik). I'm new
> to gradient boosting and I don't know it that well. Is there a fundamental
> reason why you can't make it linear in len(X) * subsample? Otherwise I might
> try to put together a patch for it.
>
> Thanks!
>
> ------------------------------------------------------------------------------
> Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
> MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
> with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
> MVPs and experts. SALE $99.99 this month only -- learn more at:
> http://p.sf.net/sfu/learnmore_122412
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Gradient boosting complexity

Reply via email to