Re: [scikit-learn] Problem with nested cross-validation example?

Joel Nothman Tue, 29 Nov 2016 03:14:40 -0800

Offer whatever patches you think will help.

On 29 November 2016 at 22:01, Daniel Homola <daniel.homol...@imperial.ac.uk>
wrote:


> Sorry, should've done that.
>
> Thanks for the PR. To me it isn't the actual concept of nested CV that
> needs more detailed explanation but the implementation in scikit-learn.
>
> I think it's not obvious at all for a newcomer (heck, I've been using it
> for years on and off and even I got confused) that the clf GridSearch
> object will carry it's inner CV object into the cross_val_score function,
> which has it's own outer CV object. Unless you know that in scikit-learn
> the CV object of an estimator is *NOT* overloaded with the
> cross_val_score function's cv parameter, but rather it will result in a
> nested CV, you simply cannot work out why this example works.. This is the
> confusing bit I think.. Do you want me to add comments that highlight this
> issue?
>
>
> On 29/11/16 10:48, Joel Nothman wrote:
>
> Wait an hour for the docs to build and you won't get artifact not found :)
>
> If you'd looked at the PR diff, you'd see I've modified the description to
> refer directly to GridSearchCV and cross_val_score:
>
> In the inner loop (here executed by GridSearchCV), the score is
>> approximately maximized by fitting a model to each training set, and then
>> directly maximized in selecting (hyper)parameters over the validation set.
>> In the outer loop (here in cross_val_score), ...
>
>
> Further comments in the code are welcome.
>
> On 29 November 2016 at 21:42, Albert Thomas <albertthoma...@gmail.com>
> wrote:
>
>> I also get "artifact not found". And I agree with Daniel.
>>
>> Once you decompose what the code is doing you realize that it does the
>> job. The simplicity of the code to perform nested cross validation using
>> scikit learn objects is impressive but I guess it also makes it less
>> obvious. So making the example clearer by explaining what the code does or
>> by adding a few comments can be useful for others.
>>
>> Albert
>>
>> On Tue, 29 Nov 2016 at 11:19, Daniel Homola <
>> daniel.homol...@imperial.ac.uk> wrote:
>>
>>> Hi Joel,
>>>
>>> Thanks a lot for the answer.
>>>
>>> "Each train/test split in cross_val_score holds out test data.
>>> GridSearchCV then splits each train set into (inner-)train and validation
>>> sets. "
>>>
>>> I know this is what nested CV supposed to do but the code is doing an
>>> excellent job at obscuring this. I'll try and add some clarification in as
>>> comments later today.
>>>
>>> Cheers,
>>>
>>> d
>>>
>>>
>>> On 29/11/16 00:07, Joel Nothman wrote:
>>>
>>> If that clarifies, please offer changes to the example (as a pull
>>> request) that make this clearer.
>>>
>>> On 29 November 2016 at 11:06, Joel Nothman <joel.noth...@gmail.com>
>>> wrote:
>>>
>>> Briefly:
>>>
>>> clf = GridSearchCV 
>>> <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>(estimator=svr,
>>>  param_grid=p_grid, cv=inner_cv)nested_score = cross_val_score 
>>> <http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score>(clf,
>>>  X=X_iris, y=y_iris, cv=outer_cv)
>>>
>>>
>>> Each train/test split in cross_val_score holds out test data.
>>> GridSearchCV then splits each train set into (inner-)train and validation
>>> sets. There is no leakage of test set knowledge from the outer loop into
>>> the grid search optimisation; no leakage of validation set knowledge into
>>> the SVR optimisation. The outer test data are reused as training data, but
>>> within each split are only used to measure generalisation error.
>>>
>>> Is that clear?
>>>
>>> On 29 November 2016 at 10:30, Daniel Homola <dani.hom...@gmail.com>
>>> wrote:
>>>
>>> Dear all,
>>>
>>>
>>> I was wondering if the following example code is valid:
>>>
>>> http://scikit-learn.org/stable/auto_examples/model_selection
>>> /plot_nested_cross_validation_iris.html
>>>
>>> My understanding is, that the point of nested cross-validation is to
>>> prevent any data leakage from the inner grid-search/param optimization CV
>>> loop into the outer model evaluation CV loop. This could be achieved if the
>>> outer CV loop's test data is completely separated from the inner loop's CV,
>>> as shown here:
>>>
>>> https://mlr-org.github.io/mlr-tutorial/release/html/img/nest
>>> ed_resampling.png
>>>
>>>
>>> The code in the above example however doesn't seem to achieve this in
>>> any way.
>>>
>>>
>>> Am I missing something here?
>>>
>>>
>>> Thanks a lot,
>>>
>>> dh
>>>
>>> _______________________________________________
>>> scikit-learn mailing list
>>> scikit-learn@python.org
>>> https://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> scikit-learn mailing 
>>> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>>>
>>> _______________________________________________ scikit-learn mailing
>>> list scikit-learn@python.org https://mail.python.org/mailma
>>> n/listinfo/scikit-learn
>>
>> _______________________________________________ scikit-learn mailing
>> list scikit-learn@python.org https://mail.python.org/mailma
>> n/listinfo/scikit-learn
>
> _______________________________________________
> scikit-learn mailing 
> listscikit-learn@python.orghttps://mail.python.org/mailman/listinfo/scikit-learn
>
>
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Problem with nested cross-validation example?

Reply via email to