Just a quick question about the gradient boosting in scikit-learn. We have
tons of data to regress on (like 100M data points), but the running time of
the algorithm is linear in the size of X no matter what subsample is set
to. Right now we just sample say 100k data points and run gradient boosting
on it, but it would be nice if we can use a much larger data set.
See
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/ensemble/gradient_boosting.py#L587for
the code – basically instead of subsampling, the algorithm just
creates
a random binary mask.
It would be nice if it was linear in the len(X) * subsample because then we
could set subsample to a very small number and use a lot more data points.
That should reduce overfitting with no disadvantages really (afaik). I'm
new to gradient boosting and I don't know it that well. Is there a
fundamental reason why you can't make it linear in len(X) * subsample?
Otherwise I might try to put together a patch for it.
Thanks!
------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. SALE $99.99 this month only -- learn more at:
http://p.sf.net/sfu/learnmore_122412
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general