I have an ipynb where I did the nested CV more “manually” in sklearn 0.17 vs sklearn 0.18 — I intended to add it as an appendix to a blog article (model eval part 4), which I had no chance to write, yet. Maybe the sklearn 0.17 part is a bit more obvious (although way less elegant) than the sklearn 0.18 version and is helpful in some sort to see what’s going on: https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb (haven’t had a chance to add comments yet, though).
Btw. does anyone have a good (research article) reference for nested CV? I see people often referrering to Dietterich [1], who mentions 5x2 CV. However, I think his 5x2 CV approach is different from the “nested cross-validation” that is commonly used since the 5x2 example is just 2-fold CV repeated 5 times (10 estimates). Maybe Sudhir & Simon [2] would be a better reference? However, they seem to only hold out 1 test sample in the outer fold? Does anyone know of a nice empirical study on nested CV (sth. like Ron Kohavi's for k-fold CV)? [1] Dietterich, Thomas G. 1998. “Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms.” Neural Computation 10 (7). MIT Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA journals-i...@mit.edu: 1895–1923. doi:10.1162/089976698300017197. [2] Varma, Sudhir, and Richard Simon. 2006. “Bias in Error Estimation When Using Cross-Validation for Model Selection.” BMC Bioinformatics 7: 91. doi:10.1186/1471-2105-7-91. > On Nov 29, 2016, at 6:12 AM, Joel Nothman <joel.noth...@gmail.com> wrote: > > Offer whatever patches you think will help. > > On 29 November 2016 at 22:01, Daniel Homola <daniel.homol...@imperial.ac.uk> > wrote: > Sorry, should've done that. > Thanks for the PR. To me it isn't the actual concept of nested CV that needs > more detailed explanation but the implementation in scikit-learn. > I think it's not obvious at all for a newcomer (heck, I've been using it for > years on and off and even I got confused) that the clf GridSearch object will > carry it's inner CV object into the cross_val_score function, which has it's > own outer CV object. Unless you know that in scikit-learn the CV object of an > estimator is NOT overloaded with the cross_val_score function's cv parameter, > but rather it will result in a nested CV, you simply cannot work out why this > example works.. This is the confusing bit I think.. Do you want me to add > comments that highlight this issue? > > > On 29/11/16 10:48, Joel Nothman wrote: >> Wait an hour for the docs to build and you won't get artifact not found :) >> >> If you'd looked at the PR diff, you'd see I've modified the description to >> refer directly to GridSearchCV and cross_val_score: >> >> In the inner loop (here executed by GridSearchCV), the score is >> approximately maximized by fitting a model to each training set, and then >> directly maximized in selecting (hyper)parameters over the validation set. >> In the outer loop (here in cross_val_score), ... >> >> Further comments in the code are welcome. >> >> On 29 November 2016 at 21:42, Albert Thomas <albertthoma...@gmail.com> wrote: >> I also get "artifact not found". And I agree with Daniel. >> >> Once you decompose what the code is doing you realize that it does the job. >> The simplicity of the code to perform nested cross validation using scikit >> learn objects is impressive but I guess it also makes it less obvious. So >> making the example clearer by explaining what the code does or by adding a >> few comments can be useful for others. >> >> Albert >> >> On Tue, 29 Nov 2016 at 11:19, Daniel Homola <daniel.homol...@imperial.ac.uk> >> wrote: >> Hi Joel, >> >> Thanks a lot for the answer. >> "Each train/test split in cross_val_score holds out test data. GridSearchCV >> then splits each train set into (inner-)train and validation sets. " >> >> I know this is what nested CV supposed to do but the code is doing an >> excellent job at obscuring this. I'll try and add some clarification in as >> comments later today. >> >> Cheers, >> >> d >> >> On 29/11/16 00:07, Joel Nothman wrote: >>> If that clarifies, please offer changes to the example (as a pull request) >>> that make this clearer. >>> >>> On 29 November 2016 at 11:06, Joel Nothman <joel.noth...@gmail.com> wrote: >>> Briefly: >>> >>> clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv) >>> nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv) >>> >>> Each train/test split in cross_val_score holds out test data. GridSearchCV >>> then splits each train set into (inner-)train and validation sets. There is >>> no leakage of test set knowledge from the outer loop into the grid search >>> optimisation; no leakage of validation set knowledge into the SVR >>> optimisation. The outer test data are reused as training data, but within >>> each split are only used to measure generalisation error. >>> >>> Is that clear? >>> >>> On 29 November 2016 at 10:30, Daniel Homola <dani.hom...@gmail.com> wrote: >>> Dear all, >>> >>> I was wondering if the following example code is valid: >>> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html >>> >>> My understanding is, that the point of nested cross-validation is to >>> prevent any data leakage from the inner grid-search/param optimization CV >>> loop into the outer model evaluation CV loop. This could be achieved if the >>> outer CV loop's test data is completely separated from the inner loop's CV, >>> as shown here: >>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png >>> >>> The code in the above example however doesn't seem to achieve this in any >>> way. >>> >>> Am I missing something here? >>> >>> Thanks a lot, >>> dh >>> >>> _______________________________________________ >>> scikit-learn mailing list >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >>> >>> >>> >>> >>> >>> ______________________________ >>> _________________ >>> scikit-learn mailing list >>> >>> scikit-learn@python.org >>> https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ scikit-learn mailing list >> scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn >> _______________________________________________ scikit-learn mailing list >> scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn >> >> ______________________________ >> _________________ >> scikit-learn mailing list >> >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn