Re: [scikit-learn] Problem with nested cross-validation example?

Sebastian Raschka Tue, 29 Nov 2016 06:22:14 -0800

I have an ipynb where I did the nested CV more “manually” in sklearn 0.17 vs 
sklearn 0.18 — I intended to add it as an appendix to a blog article (model 
eval part 4), which I had no chance to write, yet. Maybe the sklearn 0.17 part 
is a bit more obvious (although way less elegant) than the sklearn 0.18 version 
and is helpful in some sort to see what’s going on: 
https://github.com/rasbt/pattern_classification/blob/master/data_viz/model-evaluation-articles/nested_cv_code.ipynb
 (haven’t had a chance to add comments yet, though).


Btw. does anyone have a good (research article) reference for nested CV?

I see people often referrering to Dietterich [1], who mentions 5x2 CV. However, 
I think his 5x2 CV approach is different from the “nested cross-validation” 
that is commonly used since the 5x2 example is just 2-fold CV repeated 5 times 
(10 estimates). Maybe Sudhir & Simon [2] would be a better reference? However, 
they seem to only hold out 1 test sample in the outer fold? Does anyone know of 
a nice empirical study on nested CV (sth. like Ron Kohavi's for k-fold CV)?

[1] Dietterich, Thomas G. 1998. “Approximate Statistical Tests for Comparing 
Supervised Classification Learning Algorithms.” Neural Computation 10 (7). MIT 
Press 238 Main St., Suite 500, Cambridge, MA 02142-1046 USA 
journals-i...@mit.edu: 1895–1923. doi:10.1162/089976698300017197.

[2] Varma, Sudhir, and Richard Simon. 2006. “Bias in Error Estimation When 
Using Cross-Validation for Model Selection.” BMC Bioinformatics 7: 91. 
doi:10.1186/1471-2105-7-91.

> On Nov 29, 2016, at 6:12 AM, Joel Nothman <joel.noth...@gmail.com> wrote:
> 
> Offer whatever patches you think will help.
> 
> On 29 November 2016 at 22:01, Daniel Homola <daniel.homol...@imperial.ac.uk> 
> wrote:
> Sorry, should've done that. 
> Thanks for the PR. To me it isn't the actual concept of nested CV that needs 
> more detailed explanation but the implementation in scikit-learn. 
> I think it's not obvious at all for a newcomer (heck, I've been using it for 
> years on and off and even I got confused) that the clf GridSearch object will 
> carry it's inner CV object into the cross_val_score function, which has it's 
> own outer CV object. Unless you know that in scikit-learn the CV object of an 
> estimator is NOT overloaded with the cross_val_score function's cv parameter, 
> but rather it will result in a nested CV, you simply cannot work out why this 
> example works.. This is the confusing bit I think.. Do you want me to add 
> comments that highlight this issue?
> 
> 
> On 29/11/16 10:48, Joel Nothman wrote:
>> Wait an hour for the docs to build and you won't get artifact not found :)
>> 
>> If you'd looked at the PR diff, you'd see I've modified the description to 
>> refer directly to GridSearchCV and cross_val_score:
>> 
>> In the inner loop (here executed by GridSearchCV), the score is 
>> approximately maximized by fitting a model to each training set, and then 
>> directly maximized in selecting (hyper)parameters over the validation set. 
>> In the outer loop (here in cross_val_score), ...
>> 
>> Further comments in the code are welcome.
>> 
>> On 29 November 2016 at 21:42, Albert Thomas <albertthoma...@gmail.com> wrote:
>> I also get "artifact not found". And I agree with Daniel.
>> 
>> Once you decompose what the code is doing you realize that it does the job. 
>> The simplicity of the code to perform nested cross validation using scikit 
>> learn objects is impressive but I guess it also makes it less obvious. So 
>> making the example clearer by explaining what the code does or by adding a 
>> few comments can be useful for others.
>> 
>> Albert 
>> 
>> On Tue, 29 Nov 2016 at 11:19, Daniel Homola <daniel.homol...@imperial.ac.uk> 
>> wrote:
>> Hi Joel,
>> 
>> Thanks a lot for the answer.
>> "Each train/test split in cross_val_score holds out test data. GridSearchCV 
>> then splits each train set into (inner-)train and validation sets. "
>> 
>> I know this is what nested CV supposed to do but the code is doing an 
>> excellent job at obscuring this. I'll try and add some clarification in as 
>> comments later today.
>> 
>> Cheers,
>> 
>> d
>> 
>> On 29/11/16 00:07, Joel Nothman wrote:
>>> If that clarifies, please offer changes to the example (as a pull request) 
>>> that make this clearer.
>>> 
>>> On 29 November 2016 at 11:06, Joel Nothman <joel.noth...@gmail.com> wrote:
>>> Briefly:
>>> 
>>> clf = GridSearchCV(estimator=svr, param_grid=p_grid, cv=inner_cv)
>>> nested_score = cross_val_score(clf, X=X_iris, y=y_iris, cv=outer_cv)
>>> 
>>> Each train/test split in cross_val_score holds out test data. GridSearchCV 
>>> then splits each train set into (inner-)train and validation sets. There is 
>>> no leakage of test set knowledge from the outer loop into the grid search 
>>> optimisation; no leakage of validation set knowledge into the SVR 
>>> optimisation. The outer test data are reused as training data, but within 
>>> each split are only used to measure generalisation error.
>>> 
>>> Is that clear?
>>> 
>>> On 29 November 2016 at 10:30, Daniel Homola <dani.hom...@gmail.com> wrote:
>>> Dear all,
>>> 
>>> I was wondering if the following example code is valid:
>>> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
>>> 
>>> My understanding is, that the point of nested cross-validation is to 
>>> prevent any data leakage from the inner grid-search/param optimization CV 
>>> loop into the outer model evaluation CV loop. This could be achieved if the 
>>> outer CV loop's test data is completely separated from the inner loop's CV, 
>>> as shown here:
>>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png
>>> 
>>> The code in the above example however doesn't seem to achieve this in any 
>>> way.
>>> 
>>> Am I missing something here? 
>>> 
>>> Thanks a lot,
>>> dh
>>> 
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>> 
>>> 
>>> 
>>> 
>>> 
>>> ______________________________
>>> _________________
>>> scikit-learn mailing list
>>> 
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________ scikit-learn mailing list 
>> scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
>> _______________________________________________ scikit-learn mailing list 
>> scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn
>> 
>> ______________________________
>> _________________
>> scikit-learn mailing list
>> 
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Problem with nested cross-validation example?

Reply via email to