2014-08-18 20:44 GMT+02:00 Sebastian Raschka <[email protected]>: > > On Aug 18, 2014, at 12:15 PM, Olivier Grisel <[email protected]> > wrote: > >> since it would make the "estimate" and "error" calculation more >> convenient, right? > > I don't understand what you mean "estimate" by "error". Both the model > parameters, its individual predictions and its cross-validation scores or > errors can be called "estimates": anything that is derived from sampled data > points is an estimate. > > > For example, the calculation of the mean-accuracy from all iterations, and > the calculation of the standard deviation/error of the mean
Well this is not what sklearn.cross_validation.Bootstrap is doing. It's doing some weird cross-validation splits that I made up a couple of years ago (and that I now regret deeply) and that nobody uses in the literature. Again read its docstring and have a look at the source code: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cross_validation.py#L718 No-where you will see and estimate of the standard deviation of the validation score nor the standard error of the mean validation score across folds. > (just like in regular Kfold cross-validation). The KFold cross-validation iterator in sklearn does not compute the standard error of the mean score itself. The cross_val_score function with cv=KFold(5) returns the score on computed each validation fold. It would be interesting to estimate the standard deviation of the validation score (or better a 95% confidence interval of it) but: - this is not what sklearn.cross_validation.Bootstrap is doing: it just compute CV folds as all the other iterators in the sklearn.cross_validation module - estimating is the standard error of the mean of 5 points (for 5-fold CV for instance) using a bootstrapping procedure is prone to lead to bad results. Empirically I found that bootstrapping works fine to estimate confidence intervals with *at least* 50 samples (and thousands of bootstrap iterations). Therefore to obtain good confidence intervals on CV scores, the right approach (in my opinion) would be to: 1- have some kind of cross_val_predictions function that would return individual predictions for each sample in any of the validation folds of a CV procedure instead of the score on each folds as our cross_val_score function does; 2- use a bootstrapping procedure by re-sampling many times with replacement out of those predictions so as to compute a bootstrapped distribution of the validation score using; 3- take a confidence interval on that bootstrapped distribution of the validation score. Furthermore as typical scoring functions are censored (for instance the accuracy score is bounded by 0 and 1), it is very likely that the bootstrapped distribution of the validation score is going to be skewed (for instance a validation accuracy score distribution could have a 95% confidence interval between 0.94 and 1.00 with a mean at 0.99). For skewed distributions a naive percentile interval is typically wrong because of the bias introduced by the skewness. In that case this bias can be corrected by using the Bias-Corrected Accelerated Non-Parametric bootstrap procedure as implemented in scikits.bootstrap: https://github.com/cgevans/scikits-bootstrap/blob/master/scikits/bootstrap/bootstrap.py#L70 Having BCA bootstrap confidence intervals in scipy.stats would certainly make it simpler to implement this kind of feature in scikit-learn. But again what I just described here is completely different from what we have in the sklearn.cross_validation.Bootstrap class. The sklearn.cross_validation.Bootstrap class cannot be changed to implement this as it does not even have the right API to do so. It would be have to be an entirely new function or class. > I have to agree that there are probably better approaches and techniques as > you mentioned, but I wouldn't remove it > just because very few people use it in practice. We don't remove the sklearn.cross_validation.Bootstrap class because few people are using it, but because too many people are using something that is non-standard (I made it up) and very very likely not what they expect if they just read its name. At best it is causing confusion when our users read the docstring and/or its source code. At worse it causes silent modeling errors in our users code base. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
