Re: [Scikit-learn-general] bootstrap depracation warning

Olivier Grisel Mon, 18 Aug 2014 09:44:59 -0700

2014-08-18 18:28 GMT+02:00  <[email protected]>:
>
>
>
> On Mon, Aug 18, 2014 at 12:15 PM, Olivier Grisel <[email protected]>
> wrote:
>>
>> Le 18 août 2014 16:16, "Sebastian Raschka" <[email protected]> a écrit
>> :
>>
>>
>> >
>> >
>> > On Aug 18, 2014, at 3:46 AM, Olivier Grisel <[email protected]>
>> > wrote:
>> >
>> > > But the sklearn.cross_validation.Bootstrap currently implemented in
>> > > sklearn is a cross validation iterator, not a generic resampling method 
>> > > to
>> > > estimate variance or confidence intervals. Don't be mislead by the name. 
>> > > If
>> > > we chose to deprecate and then remove this class, it's precisely because 
>> > > it
>> > > causes confusion.
>> >
>> > Hm, I can kind of see why the Bootstrap calls was initially put into
>> > sklearn.cross_validation, technically, the "approaches" (cross validation,
>> > bootstrap, jackknife) are very related. The only difference is that you 
>> > have
>> > sampling "with replacement" in the bootstrap approach and that you would
>> > typically want to have >1000 iterations.
>>
>> > So, the suggestion would be to remove Bootstrap and use
>> > sklearn.utils.resample in future?
>>
>> Well it depends why do you want to use bootstrapping for. If it's for
>> model evaluation (estimation of some validation score), then the recommended
>> way is to use ShuffleSplit or StratifiedShuffleSplit instead. If you want
>> generic bootstrap estimation features such as confidence interval estimation
>> (that does not exist in scikit-learn by the way), then I would recommend you
>> to have a look at scikits.bootstrap [1] which also implement bias correction
>> for skewed distribution which is non-trivial to do manually.
>>
>> [1] https://scikits.appspot.com/bootstrap
>>
>> sklearn.utils is meant only for internal use in the scikit-learn project.
>> For instance sklearn.utils.resample is useful to implement resampling
>> internally in bagging models if I remember correctly.
>>
>> > I would say that it is good that the Bootstrap is implemented like an CV
>> > object,
>>
>> I precisely think the opposite. There is no point in using sampling with
>> replacement vs sampling without replacement to estimate the validation error
>> of a model. Traditional strategies for cross-validation as implemented in
>> Shuffle & Split are as flexible and simpler to interpret than our weird
>> Bootstrap cross-validation iterator.
>>
>> See also: http://youtu.be/BzHz0J9a6k0?t=9m38s
>>
>> > since it would make the "estimate" and "error" calculation more
>> > convenient, right?
>>
>> I don't understand what you mean "estimate" by "error". Both the model
>> parameters, its individual predictions and its cross-validation scores or
>> errors can be called "estimates": anything that is derived from sampled data
>> points is an estimate.
>
>
> Just a remark from the sidelines,
> (I hope to get bootstrap and cross-validation iterators into the next
> version of statsmodels, borrowing some of the ideas and code from
> scikit-learn, but emphasis in statsmodels will be on bootstrap and
> permutation iterators.)
>
> What I think sklearn doesn't have, are early stopping with randomized
> selection for cross-validation iterators. If LOO/jacknife are expensive to
> calculate for all LOO sets. Can you randomly select among the LOO sets, or
> similar for other iterators?


No, but that's would be good idea for ShuffleSplit as well. If I
understand correctly, you would pass something like tolerance
parameter (e.g. I want a validation score with precise to 2 decimals)
and use as few iterations as possible to each that precision and then
stop sampling. Is that right?

> Similar, permutation inference is often difficult because the set of
> permutations is getting too large, then bootstrap is the usual alternative
> for larger samples.
>
> (I may be incorrect since I only briefly looked at the changes to your
> cross-validation.)

One thing to keep in mind is that sklearn.cross_validation.Bootstrap
is not the real bootstrap: it's a random permutation + split + random
sampling with replacement on both sides of the split independently:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L718

This 2 steps procedures is done to make sure that no test samples is
part of the training fold at each iteration. A more natural way to
respect that constraint would be to sample with replacement from the
full dataset and then use out-of-bags samples for the validation set.
But then you would loose control on the size of the test fold. This
second strategy is more like the real bootstrap and is the one I
should have implemented initially instead of the weird beast that
sklearn.cross_validation.Bootstrap is currently.

-- 
Olivier

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] bootstrap depracation warning

Reply via email to