Re: [scikit-learn] Problem with nested cross-validation example?

Daniel Homola Tue, 29 Nov 2016 03:03:08 -0800

Sorry, should've done that.

Thanks for the PR. To me it isn't the actual concept of nested CV thatneeds more detailed explanation but the implementation in scikit-learn.

I think it's not obvious at all for a newcomer (heck, I've been using itfor years on and off and even I got confused) that the clf GridSearchobject will carry it's inner CV object into the cross_val_scorefunction, which has it's own outer CV object. Unless you know that inscikit-learn the CV object of an estimator is *NOT* overloaded with thecross_val_score function's cv parameter, but rather it will result in anested CV, you simply cannot work out why this example works.. This isthe confusing bit I think.. Do you want me to add comments thathighlight this issue?



On 29/11/16 10:48, Joel Nothman wrote:

Wait an hour for the docs to build and you won't get artifact notfound :)

If you'd looked at the PR diff, you'd see I've modified thedescription to refer directly to GridSearchCV and cross_val_score:


    In the inner loop (here executed by |GridSearchCV|), the score is
    approximately maximized by fitting a model to each training set,
    and then directly maximized in selecting (hyper)parameters over
    the validation set. In the outer loop (here in |cross_val_score|), ...


Further comments in the code are welcome.

On 29 November 2016 at 21:42, Albert Thomas <albertthoma...@gmail.com<mailto:albertthoma...@gmail.com>> wrote:


    I also get "artifact not found". And I agree with Daniel.

    Once you decompose what the code is doing you realize that it does
    the job. The simplicity of the code to perform nested cross
    validation using scikit learn objects is impressive but I guess it
    also makes it less obvious. So making the example clearer by
    explaining what the code does or by adding a few comments can be
    useful for others.

    Albert

    On Tue, 29 Nov 2016 at 11:19, Daniel Homola
    <daniel.homol...@imperial.ac.uk
    <mailto:daniel.homol...@imperial.ac.uk>> wrote:

        Hi Joel,

        Thanks a lot for the answer.

        "Each train/test split in cross_val_score holds out test data.
        GridSearchCV then splits each train set into (inner-)train and
        validation sets. "

        I know this is what nested CV supposed to do but the code is
        doing an excellent job at obscuring this. I'll try and add
        some clarification in as comments later today.

        Cheers,

        d


        On 29/11/16 00:07, Joel Nothman wrote:

        If that clarifies, please offer changes to the example (as a
        pull request) that make this clearer.

        On 29 November 2016 at 11:06, Joel Nothman
        <joel.noth...@gmail.com <mailto:joel.noth...@gmail.com>> wrote:

            Briefly:

            clf  =  GridSearchCV
            
<http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV>(estimator=svr,
  param_grid=p_grid,  cv=inner_cv)
            nested_score  =  cross_val_score
            
<http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score>(clf,
  X=X_iris,  y=y_iris,  cv=outer_cv)


            Each train/test split in cross_val_score holds out test
            data. GridSearchCV then splits each train set into
            (inner-)train and validation sets. There is no leakage of
            test set knowledge from the outer loop into the grid
            search optimisation; no leakage of validation set
            knowledge into the SVR optimisation. The outer test data
            are reused as training data, but within each split are
            only used to measure generalisation error.

            Is that clear?

            On 29 November 2016 at 10:30, Daniel Homola
            <dani.hom...@gmail.com <mailto:dani.hom...@gmail.com>> wrote:

                Dear all,


                I was wondering if the following example code is valid:

                
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html
                
<http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html>

                My understanding is, that the point of nested
                cross-validation is to prevent any data leakage from
                the inner grid-search/param optimization CV loop into
                the outer model evaluation CV loop. This could be
                achieved if the outer CV loop's test data is
                completely separated from the inner loop's CV, as
                shown here:

                
https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png
                
<https://mlr-org.github.io/mlr-tutorial/release/html/img/nested_resampling.png>


                The code in the above example however doesn't seem to
                achieve this in any way.


                Am I missing something here?


                Thanks a lot,

                dh


                _______________________________________________
                scikit-learn mailing list
                scikit-learn@python.org <mailto:scikit-learn@python.org>
                https://mail.python.org/mailman/listinfo/scikit-learn
                <https://mail.python.org/mailman/listinfo/scikit-learn>





        _______________________________________________
        scikit-learn mailing list
        scikit-learn@python.org <mailto:scikit-learn@python.org>
        https://mail.python.org/mailman/listinfo/scikit-learn
        <https://mail.python.org/mailman/listinfo/scikit-learn>

        _______________________________________________ scikit-learn
        mailing list scikit-learn@python.org
        <mailto:scikit-learn@python.org>
        https://mail.python.org/mailman/listinfo/scikit-learn

<https://mail.python.org/mailman/listinfo/scikit-learn>

    _______________________________________________ scikit-learn
    mailing list scikit-learn@python.org
    <mailto:scikit-learn@python.org>
    https://mail.python.org/mailman/listinfo/scikit-learn

<https://mail.python.org/mailman/listinfo/scikit-learn>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Problem with nested cross-validation example?

Reply via email to