Le 18 août 2014 16:16, "Sebastian Raschka" <[email protected]> a écrit :
>
>
> On Aug 18, 2014, at 3:46 AM, Olivier Grisel <[email protected]>
wrote:
>
> > But the sklearn.cross_validation.Bootstrap currently implemented in
sklearn is a cross validation iterator, not a generic resampling method to
estimate variance or confidence intervals. Don't be mislead by the name. If
we chose to deprecate and then remove this class, it's precisely because it
causes confusion.
>
> Hm, I can kind of see why the Bootstrap calls was initially put into
sklearn.cross_validation, technically, the "approaches" (cross validation,
bootstrap, jackknife) are very related. The only difference is that you
have sampling "with replacement" in the bootstrap approach and that you
would typically want to have >1000 iterations.

> So, the suggestion would be to remove Bootstrap and use
sklearn.utils.resample in future?

Well it depends why do you want to use bootstrapping for. If it's for model
evaluation (estimation of some validation score), then the recommended way
is to use ShuffleSplit or StratifiedShuffleSplit instead. If you want
generic bootstrap estimation features such as confidence interval
estimation (that does not exist in scikit-learn by the way), then I would
recommend you to have a look at scikits.bootstrap [1] which also implement
bias correction for skewed distribution which is non-trivial to do manually.

[1] https://scikits.appspot.com/bootstrap

sklearn.utils is meant only for internal use in the scikit-learn project.
For instance sklearn.utils.resample is useful to implement resampling
internally in bagging models if I remember correctly.

> I would say that it is good that the Bootstrap is implemented like an CV
object,

I precisely think the opposite. There is no point in using sampling with
replacement vs sampling without replacement to estimate the validation
error of a model. Traditional strategies for cross-validation as
implemented in Shuffle & Split are as flexible and simpler to interpret
than our weird Bootstrap cross-validation iterator.

See also: http://youtu.be/BzHz0J9a6k0?t=9m38s

> since it would make the "estimate" and "error" calculation more
convenient, right?

I don't understand what you mean "estimate" by "error". Both the model
parameters, its individual predictions and its cross-validation scores or
errors can be called "estimates": anything that is derived from sampled
data points is an estimate.

-- 

Olivier
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to