Re: [scikit-learn] Control over the inner loop in GridSearchCV

Sebastian Raschka Mon, 27 Feb 2017 14:49:17 -0800

Hi, Ludovico,
my bet is that there is an issue with the format of the object that you pass to 
the `cv` param of the GridSearchCV. What you need is e.g.,
 "      • An iterable yielding train, test splits.”


Or more specifically, say you have a generator, my_gen, that is yielding these 
splits, the way the indices whould be organized would be:

list(my_gen)[0][0] # stores an array of indices used as training fold in the 
1st round
# e.g., sth like np.array([ 0,  1,  2,  3,  4,  5,  6,  …])

list(my_gen)[0][1] # stores an array of indices used as test fold in the 1st 
round
# e.g., sth like np.array([ 102, 103, 104, 105, 106, 107, 108,  …])

list(my_gen)[1][0] # stores an array of indices used as training fold in the 
2nd round
my_gen[1][1] # stores an array of indices used as test fold in the 2nd round

list(my_gen)[2][0] # stores an array of indices used as training fold in the 
3rd round
list(my_gen)[2][1] # stores an array of indices used as test fold in the 3rd 
round

Hope that helps.

Best,
Sebastian

> The following did not work. This is what we get --> ValueError: too many 
> values to unpack

> On Feb 27, 2017, at 5:13 PM, Ludovico Coletta <ludo25...@hotmail.com> wrote:
> 
> Dear Sebastian,
> 
> thank you for the quick answer.
> 
> The data is stored in a numpy array (shape: 68, 24). We are using scikit 18.1
> 
> I saw that I wrote something wrong in previous email. Your solution is indeed 
> correct if we leave Scikit decide how to manage the inner loop. This is what 
> we did at the beginning. By doing so, we noticed that the classifier's 
> perfomance decrease (in comparison to a non-optimised classifier). We would 
> like to control the inner split and we need to store the metrics for each fold
> 
> The way we obtained the indices for the optimization, train and test phase is 
> the equivalent of something like that:
> 
> rs = ShuffleSplit(n_splits=9, test_size=.25,random_state=42) 
> indices_for_each_cv = list(rs.split(data[0:11]))
> 
> Maybe I can make myself clearer if I write what we would like to achieve for 
> the first cross validation fold (I acknowledge that the previous email was 
> quite a mess, sorry). Outer loop: 48 for training, 20 for testing. Of the 48 
> training subjects, we would like to use 42 for optimization, 6 for testing 
> the parameters. We got the indices so that we match the different scanners 
> even in the optimization phase, but we are not able to pass them to 
> GridSearch object. 
> 
> The following did not work. This is what we get --> ValueError: too many 
> values to unpack
> 
> ii = 0
> 
> while ii < len(cv_final):
> # fit and predict
> 
> clf = GridSearchCV(
> pipeline, 
> param_grid=param_grid, 
> verbose=1, 
>                 cv = cv_final_nested[ii], # how to split the 48 train 
> subjects for the optimization
> scoring='roc_auc', 
> n_jobs= -1)
> 
> clf.fit(data[cv_final[ii][0]], y[cv_final[ii][0]]) # the train data of the 
> outer loop for the first (i.e. the 48 subjects)
> predictions.append(clf.predict(data[cv_final[ii][1]])) # Predict the 20 
> subjects left out for test in the outer loop
> 
> ii = ii + 1
> 
> This however works and should be (more or less) what we would like to achieve 
> with the above loop. However, extracting the best parameters for each fold in 
> order to predict the left out data seems impossible or very laborious.
> 
> clf = GridSearchCV(
> pipeline, 
>     
> param_grid=param_grid, 
> verbose=1, 
>               cv = cv_final_nested,
> scoring='roc_auc', 
> n_jobs= -1)
> 
> clf.fit(data,y)
> 
> 
> Any hint on how to solve this problem would be really appreciated.
> 
> Best
> Ludovico
> 
> 
> 
> 
> Da: scikit-learn <scikit-learn-bounces+ludo25_90=hotmail....@python.org> per 
> conto di scikit-learn-requ...@python.org <scikit-learn-requ...@python.org>
> Inviato: lunedì 27 febbraio 2017 17.27
> A: scikit-learn@python.org
> Oggetto: scikit-learn Digest, Vol 11, Issue 29
>  
> Send scikit-learn mailing list submissions to
>         scikit-learn@python.org
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/scikit-learn
> scikit-learn Info Page - Python
> mail.python.org
> To see the collection of prior postings to the list, visit the scikit-learn 
> Archives. Using scikit-learn: To post a message to all the list members ...
> 
> 
> or, via email, send a message with subject or body 'help' to
>         scikit-learn-requ...@python.org
> 
> You can reach the person managing the list at
>         scikit-learn-ow...@python.org
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of scikit-learn digest..."
> 
> 
> Today's Topics:
> 
>    1. GSoC 2017 (Gael Varoquaux)
>    2. Control over the inner loop in GridSearchCV (Ludovico Coletta)
>    3. Re: Control over the inner loop in GridSearchCV
>       (Sebastian Raschka)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 27 Feb 2017 11:58:35 +0100
> From: Gael Varoquaux <gael.varoqu...@normalesup.org>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn@python.org>
> Subject: [scikit-learn] GSoC 2017
> Message-ID: <20170227105835.gc2041...@phare.normalesup.org>
> Content-Type: text/plain; charset=iso-8859-1
> 
> Hi,
> 
> Students have been inquiring about the GSoC (Google Summer of Code) with
> scikit-learn, and the core team has been quite silent about team.
> 
> I am happy to announce that we will be taking part in the scikit-learn
> again. The reason that we decided to do this is to give a chance to the
> young, talented, and motivated students.
> 
> Importantly, our most limiting resource is the time of our experienced
> developers. This is clearly visible from the number of pending pull
> requests. Hence, we need students to be very able and independent. This
> of course means that they will be getting supervision from mentors. Such
> supervision is crucial for moving forward with a good project, that
> delivers mergeable code. However, we will need the students to be very
> good at interacting efficiently with the mentors. Also, I should stress
> that we will be able to take only a very few numbers of students.
> 
> With that said, let me introduce the 2017 GSoC for scikit-learn. We have
> set up a wiki page which summarizes the experiences from last year and
> the ideas for this year:
> https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2017
> Google summer of code (GSOC) 2017 · scikit-learn/scikit-learn Wiki · GitHub
> github.com
> scikit-learn: machine learning in Python
> 
> 
> 
> Interested students should declare their interest on the mailing list,
> and discuss with possible mentors here. Factors of success will be
> 
> * careful work on a good proposal, that takes on of the ideas on the wiki
>   but breaks it down in a realistic plan with multiple steps and shows a
>   good understanding of the problem.
> 
> * demonstration of the required skillset via successful pull requests in
>   scikit-learn.
> 
> Cheers,
> 
> Ga?l
> 
> 
> -- 
>     Gael Varoquaux
>     Researcher, INRIA Parietal
>     NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
>     Phone:  ++ 33-1-69-08-79-68
>     http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
> Gael Varoquaux (@GaelVaroquaux) | Twitter
> twitter.com
> The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: 
> ►Brain, Data, & Computational science ►#python #pydata #sklearn ►Machine 
> learning for fMRI ►Photography on @artgael. Paris, France
> 
> Gaël Varoquaux: computer / data / brain science
> gael-varoquaux.info
> Gaël Varoquaux, computer / data / brain science ... Latest posts . misc 
> personnal programming science Our research in 2016: personal scientific 
> highlights
> 
> 
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 27 Feb 2017 14:27:59 +0000
> From: Ludovico Coletta <ludo25...@hotmail.com>
> To: "scikit-learn@python.org" <scikit-learn@python.org>
> Subject: [scikit-learn] Control over the inner loop in GridSearchCV
> Message-ID:
>         
> <blupr0301mb2017606e3e103266bbab5e698c...@blupr0301mb2017.namprd03.prod.outlook.com>
>         
> Content-Type: text/plain; charset="iso-8859-1"
> 
> Dear Scikit experts,
> 
> 
> we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we 
> hope you will.
> 
> 
> We are analysing neuroimaging data coming from 3 different MRI scanners, 
> where for each scanner we have a healthy group and a disease group. We would 
> like to merge the data from the 3 different scanners in order to classify the 
> healthy subjects from the one who have the disease.
> 
> 
> The problem is that we can almost perfectly classify the subjects according 
> to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We 
> are using a custom cross validation schema to account for the different 
> scanners: when no hyper-parameter (SVM) optimization is performed, everything 
> is straightforward. Problems arise when we would like to perform 
> hyperparameter optimization: in this case we need to balance for the 
> different scanner in the optimization phase as well. We also found a custom 
> cv schema for this, but we are not able to pass it to GridSearchCV object. We 
> would like to get something like the following:
> 
> 
> pipeline = Pipeline([('scl', StandardScaler()),
>                     ('sel', RFE(estimator,step=0.2)),
>                                     ('clf', SVC(probability=True, 
> random_state=42))])
> 
> 
> param_grid = [{'sel__n_features_to_select':[22,15,10,2],
>                            'clf__C': np.logspace(-3, 5, 100),
>                    'clf__kernel':['linear']}]
> 
> clf = GridSearchCV(pipeline,
>                           param_grid=param_grid,
>                   verbose=1,
>                                   scoring='roc_auc',
>                   n_jobs= -1)
> 
> # cv_final is the custom cv for the outer loop (9 folds)
> 
> ii = 0
> 
> while ii < len(cv_final):
> # fit and predict
> 
> clf.fit(data[?]], y[[?]])
> predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
> ii = ii + 1
> 
> 
> We tried almost everything. When we define clf in the loop, we pass the -ith 
> cv_nested as cv argument, and we fit it on the training data of the -ith 
> custom_cv fold, we get an "Too many values to unpack" error. On the other 
> end, when we try to pass the nested -ith cv fold as cv argument for clf, and 
> we call fit on the same cv_nested fold, we get an "Index out of bound" error.
> 
> Two questions:
> 
> 1) Is there any workaround to avoid the split when clf is called without a cv 
> argument?
> 
> 2) We suppose that for hyperparameter optimization the test data is removed 
> from the dataset and a  new dataset is created. Is this true? In this case we 
> only have to adjust the indices accordingly
> 
> 
> Thank your for your time and sorry for the long text
> 
> Ludovico
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> <http://mail.python.org/pipermail/scikit-learn/attachments/20170227/e80777cb/attachment-0001.html>
> 
> ------------------------------
> 
> Message: 3
> Date: Mon, 27 Feb 2017 11:27:24 -0500
> From: Sebastian Raschka <se.rasc...@gmail.com>
> To: Scikit-learn user and developer mailing list
>         <scikit-learn@python.org>
> Subject: Re: [scikit-learn] Control over the inner loop in
>         GridSearchCV
> Message-ID: <fc403fd1-9a00-424a-8453-9d60fe176...@gmail.com>
> Content-Type: text/plain; charset=utf-8
> 
> Hi, Ludovico,
> what format (shape) is data in? Are these the arrays from a Kfold iterator? 
> In this case, the ?question marks? in your code snippet should simply be the 
> train and validation subset indices generated by the KFold generator. E.g.,  
> 
> skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1)
> for outer_train_idx, outer_valid_idx in skfold:
>     ?
>     gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx])
> 
> > 
> > On the other end, when we try to pass the nested -ith cv fold as cv 
> > argument for clf, and we call fit on the same cv_nested fold, we get an 
> > "Index out of bound" error.  
> > Two questions: 
> 
> Are you using an version older than scikit-learn 0.18? Techically, the 
> GridSearchCV, RandomizedSearchCV, cross_val_score ? should all support 
> iterables that of train_ and test_indices e.g.:
> 
> outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)
> 
> for name, gs_est in sorted(gridcvs.items()):
>     nested_score = cross_val_score(gs_est,                 
>     X=X_train,                      
>     y=y_train,                                 
>    cv=outer_cv,                             
>    n_jobs=1)
> 
> 
> Best,
> Sebastian
> 
> > On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <ludo25...@hotmail.com> wrote:
> > 
> > Dear Scikit experts,
> > 
> > we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we 
> > hope you will. 
> > 
> > We are analysing neuroimaging data coming from 3 different MRI scanners, 
> > where for each scanner we have a healthy group and a disease group. We 
> > would like to merge the data from the 3 different scanners in order to 
> > classify the healthy subjects from the one who have the disease. 
> > 
> > The problem is that we can almost perfectly classify the subjects according 
> > to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We 
> > are using a custom cross validation schema to account for the different 
> > scanners: when no hyper-parameter (SVM) optimization is performed, 
> > everything is straightforward. Problems arise when we would like to perform 
> > hyperparameter optimization: in this case we need to balance for the 
> > different scanner in the optimization phase as well. We also found a custom 
> > cv schema for this, but we are not able to pass it to GridSearchCV object. 
> > We would like to get something like the following:
> > 
> > pipeline = Pipeline([('scl', StandardScaler()),
> >                     ('sel', RFE(estimator,step=0.2)),       
> >                                     ('clf', SVC(probability=True, 
> > random_state=42))])
> >                      
> >                      
> > param_grid = [{'sel__n_features_to_select':[22,15,10,2],
> >                            'clf__C': np.logspace(-3, 5, 100), 
> >                    'clf__kernel':['linear']}]
> > 
> > clf = GridSearchCV(pipeline, 
> >                           param_grid=param_grid, 
> >                   verbose=1, 
> >                                   scoring='roc_auc', 
> >                   n_jobs= -1)
> > 
> > # cv_final is the custom cv for the outer loop (9 folds)
> > 
> > ii = 0
> > 
> > while ii < len(cv_final):  
> > # fit and predict
> > 
> > clf.fit(data[?]], y[[?]])
> > predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
> > ii = ii + 1
> > 
> > We tried almost everything. When we define clf in the loop, we pass the 
> > -ith cv_nested as cv argument, and we fit it on the training data of the 
> > -ith custom_cv fold, we get an "Too many values to unpack" error. On the 
> > other end, when we try to pass the nested -ith cv fold as cv argument for 
> > clf, and we call fit on the same cv_nested fold, we get an "Index out of 
> > bound" error.  
> > Two questions: 
> > 1) Is there any workaround to avoid the split when clf is called without a 
> > cv argument? 
> > 2) We suppose that for hyperparameter optimization the test data is removed 
> > from the dataset and a  new dataset is created. Is this true? In this case 
> > we only have to adjust the indices accordingly
> > 
> > Thank your for your time and sorry for the long text
> > Ludovico
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> scikit-learn Info Page - Python
> mail.python.org
> To see the collection of prior postings to the list, visit the scikit-learn 
> Archives. Using scikit-learn: To post a message to all the list members ...
> 
> 
> 
> 
> 
> ------------------------------
> 
> Subject: Digest Footer
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> scikit-learn Info Page - Python
> mail.python.org
> To see the collection of prior postings to the list, visit the scikit-learn 
> Archives. Using scikit-learn: To post a message to all the list members ...
> 
> 
> 
> 
> ------------------------------
> 
> End of scikit-learn Digest, Vol 11, Issue 29
> ********************************************
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Control over the inner loop in GridSearchCV

Reply via email to