Hi, Ludovico, my bet is that there is an issue with the format of the object that you pass to the `cv` param of the GridSearchCV. What you need is e.g., " • An iterable yielding train, test splits.”
Or more specifically, say you have a generator, my_gen, that is yielding these splits, the way the indices whould be organized would be: list(my_gen)[0][0] # stores an array of indices used as training fold in the 1st round # e.g., sth like np.array([ 0, 1, 2, 3, 4, 5, 6, …]) list(my_gen)[0][1] # stores an array of indices used as test fold in the 1st round # e.g., sth like np.array([ 102, 103, 104, 105, 106, 107, 108, …]) list(my_gen)[1][0] # stores an array of indices used as training fold in the 2nd round my_gen[1][1] # stores an array of indices used as test fold in the 2nd round list(my_gen)[2][0] # stores an array of indices used as training fold in the 3rd round list(my_gen)[2][1] # stores an array of indices used as test fold in the 3rd round Hope that helps. Best, Sebastian > The following did not work. This is what we get --> ValueError: too many > values to unpack > On Feb 27, 2017, at 5:13 PM, Ludovico Coletta <ludo25...@hotmail.com> wrote: > > Dear Sebastian, > > thank you for the quick answer. > > The data is stored in a numpy array (shape: 68, 24). We are using scikit 18.1 > > I saw that I wrote something wrong in previous email. Your solution is indeed > correct if we leave Scikit decide how to manage the inner loop. This is what > we did at the beginning. By doing so, we noticed that the classifier's > perfomance decrease (in comparison to a non-optimised classifier). We would > like to control the inner split and we need to store the metrics for each fold > > The way we obtained the indices for the optimization, train and test phase is > the equivalent of something like that: > > rs = ShuffleSplit(n_splits=9, test_size=.25,random_state=42) > indices_for_each_cv = list(rs.split(data[0:11])) > > Maybe I can make myself clearer if I write what we would like to achieve for > the first cross validation fold (I acknowledge that the previous email was > quite a mess, sorry). Outer loop: 48 for training, 20 for testing. Of the 48 > training subjects, we would like to use 42 for optimization, 6 for testing > the parameters. We got the indices so that we match the different scanners > even in the optimization phase, but we are not able to pass them to > GridSearch object. > > The following did not work. This is what we get --> ValueError: too many > values to unpack > > ii = 0 > > while ii < len(cv_final): > # fit and predict > > clf = GridSearchCV( > pipeline, > param_grid=param_grid, > verbose=1, > cv = cv_final_nested[ii], # how to split the 48 train > subjects for the optimization > scoring='roc_auc', > n_jobs= -1) > > clf.fit(data[cv_final[ii][0]], y[cv_final[ii][0]]) # the train data of the > outer loop for the first (i.e. the 48 subjects) > predictions.append(clf.predict(data[cv_final[ii][1]])) # Predict the 20 > subjects left out for test in the outer loop > > ii = ii + 1 > > This however works and should be (more or less) what we would like to achieve > with the above loop. However, extracting the best parameters for each fold in > order to predict the left out data seems impossible or very laborious. > > clf = GridSearchCV( > pipeline, > > param_grid=param_grid, > verbose=1, > cv = cv_final_nested, > scoring='roc_auc', > n_jobs= -1) > > clf.fit(data,y) > > > Any hint on how to solve this problem would be really appreciated. > > Best > Ludovico > > > > > Da: scikit-learn <scikit-learn-bounces+ludo25_90=hotmail....@python.org> per > conto di scikit-learn-requ...@python.org <scikit-learn-requ...@python.org> > Inviato: lunedì 27 febbraio 2017 17.27 > A: scikit-learn@python.org > Oggetto: scikit-learn Digest, Vol 11, Issue 29 > > Send scikit-learn mailing list submissions to > scikit-learn@python.org > > To subscribe or unsubscribe via the World Wide Web, visit > https://mail.python.org/mailman/listinfo/scikit-learn > scikit-learn Info Page - Python > mail.python.org > To see the collection of prior postings to the list, visit the scikit-learn > Archives. Using scikit-learn: To post a message to all the list members ... > > > or, via email, send a message with subject or body 'help' to > scikit-learn-requ...@python.org > > You can reach the person managing the list at > scikit-learn-ow...@python.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of scikit-learn digest..." > > > Today's Topics: > > 1. GSoC 2017 (Gael Varoquaux) > 2. Control over the inner loop in GridSearchCV (Ludovico Coletta) > 3. Re: Control over the inner loop in GridSearchCV > (Sebastian Raschka) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 27 Feb 2017 11:58:35 +0100 > From: Gael Varoquaux <gael.varoqu...@normalesup.org> > To: Scikit-learn user and developer mailing list > <scikit-learn@python.org> > Subject: [scikit-learn] GSoC 2017 > Message-ID: <20170227105835.gc2041...@phare.normalesup.org> > Content-Type: text/plain; charset=iso-8859-1 > > Hi, > > Students have been inquiring about the GSoC (Google Summer of Code) with > scikit-learn, and the core team has been quite silent about team. > > I am happy to announce that we will be taking part in the scikit-learn > again. The reason that we decided to do this is to give a chance to the > young, talented, and motivated students. > > Importantly, our most limiting resource is the time of our experienced > developers. This is clearly visible from the number of pending pull > requests. Hence, we need students to be very able and independent. This > of course means that they will be getting supervision from mentors. Such > supervision is crucial for moving forward with a good project, that > delivers mergeable code. However, we will need the students to be very > good at interacting efficiently with the mentors. Also, I should stress > that we will be able to take only a very few numbers of students. > > With that said, let me introduce the 2017 GSoC for scikit-learn. We have > set up a wiki page which summarizes the experiences from last year and > the ideas for this year: > https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2017 > Google summer of code (GSOC) 2017 · scikit-learn/scikit-learn Wiki · GitHub > github.com > scikit-learn: machine learning in Python > > > > Interested students should declare their interest on the mailing list, > and discuss with possible mentors here. Factors of success will be > > * careful work on a good proposal, that takes on of the ideas on the wiki > but breaks it down in a realistic plan with multiple steps and shows a > good understanding of the problem. > > * demonstration of the required skillset via successful pull requests in > scikit-learn. > > Cheers, > > Ga?l > > > -- > Gael Varoquaux > Researcher, INRIA Parietal > NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France > Phone: ++ 33-1-69-08-79-68 > http://gael-varoquaux.info http://twitter.com/GaelVaroquaux > Gael Varoquaux (@GaelVaroquaux) | Twitter > twitter.com > The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: > ►Brain, Data, & Computational science ►#python #pydata #sklearn ►Machine > learning for fMRI ►Photography on @artgael. Paris, France > > Gaël Varoquaux: computer / data / brain science > gael-varoquaux.info > Gaël Varoquaux, computer / data / brain science ... Latest posts . misc > personnal programming science Our research in 2016: personal scientific > highlights > > > > > ------------------------------ > > Message: 2 > Date: Mon, 27 Feb 2017 14:27:59 +0000 > From: Ludovico Coletta <ludo25...@hotmail.com> > To: "scikit-learn@python.org" <scikit-learn@python.org> > Subject: [scikit-learn] Control over the inner loop in GridSearchCV > Message-ID: > > <blupr0301mb2017606e3e103266bbab5e698c...@blupr0301mb2017.namprd03.prod.outlook.com> > > Content-Type: text/plain; charset="iso-8859-1" > > Dear Scikit experts, > > > we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we > hope you will. > > > We are analysing neuroimaging data coming from 3 different MRI scanners, > where for each scanner we have a healthy group and a disease group. We would > like to merge the data from the 3 different scanners in order to classify the > healthy subjects from the one who have the disease. > > > The problem is that we can almost perfectly classify the subjects according > to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We > are using a custom cross validation schema to account for the different > scanners: when no hyper-parameter (SVM) optimization is performed, everything > is straightforward. Problems arise when we would like to perform > hyperparameter optimization: in this case we need to balance for the > different scanner in the optimization phase as well. We also found a custom > cv schema for this, but we are not able to pass it to GridSearchCV object. We > would like to get something like the following: > > > pipeline = Pipeline([('scl', StandardScaler()), > ('sel', RFE(estimator,step=0.2)), > ('clf', SVC(probability=True, > random_state=42))]) > > > param_grid = [{'sel__n_features_to_select':[22,15,10,2], > 'clf__C': np.logspace(-3, 5, 100), > 'clf__kernel':['linear']}] > > clf = GridSearchCV(pipeline, > param_grid=param_grid, > verbose=1, > scoring='roc_auc', > n_jobs= -1) > > # cv_final is the custom cv for the outer loop (9 folds) > > ii = 0 > > while ii < len(cv_final): > # fit and predict > > clf.fit(data[?]], y[[?]]) > predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data > ii = ii + 1 > > > We tried almost everything. When we define clf in the loop, we pass the -ith > cv_nested as cv argument, and we fit it on the training data of the -ith > custom_cv fold, we get an "Too many values to unpack" error. On the other > end, when we try to pass the nested -ith cv fold as cv argument for clf, and > we call fit on the same cv_nested fold, we get an "Index out of bound" error. > > Two questions: > > 1) Is there any workaround to avoid the split when clf is called without a cv > argument? > > 2) We suppose that for hyperparameter optimization the test data is removed > from the dataset and a new dataset is created. Is this true? In this case we > only have to adjust the indices accordingly > > > Thank your for your time and sorry for the long text > > Ludovico > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: > <http://mail.python.org/pipermail/scikit-learn/attachments/20170227/e80777cb/attachment-0001.html> > > ------------------------------ > > Message: 3 > Date: Mon, 27 Feb 2017 11:27:24 -0500 > From: Sebastian Raschka <se.rasc...@gmail.com> > To: Scikit-learn user and developer mailing list > <scikit-learn@python.org> > Subject: Re: [scikit-learn] Control over the inner loop in > GridSearchCV > Message-ID: <fc403fd1-9a00-424a-8453-9d60fe176...@gmail.com> > Content-Type: text/plain; charset=utf-8 > > Hi, Ludovico, > what format (shape) is data in? Are these the arrays from a Kfold iterator? > In this case, the ?question marks? in your code snippet should simply be the > train and validation subset indices generated by the KFold generator. E.g., > > skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1) > for outer_train_idx, outer_valid_idx in skfold: > ? > gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx]) > > > > > On the other end, when we try to pass the nested -ith cv fold as cv > > argument for clf, and we call fit on the same cv_nested fold, we get an > > "Index out of bound" error. > > Two questions: > > Are you using an version older than scikit-learn 0.18? Techically, the > GridSearchCV, RandomizedSearchCV, cross_val_score ? should all support > iterables that of train_ and test_indices e.g.: > > outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1) > > for name, gs_est in sorted(gridcvs.items()): > nested_score = cross_val_score(gs_est, > X=X_train, > y=y_train, > cv=outer_cv, > n_jobs=1) > > > Best, > Sebastian > > > On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <ludo25...@hotmail.com> wrote: > > > > Dear Scikit experts, > > > > we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we > > hope you will. > > > > We are analysing neuroimaging data coming from 3 different MRI scanners, > > where for each scanner we have a healthy group and a disease group. We > > would like to merge the data from the 3 different scanners in order to > > classify the healthy subjects from the one who have the disease. > > > > The problem is that we can almost perfectly classify the subjects according > > to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We > > are using a custom cross validation schema to account for the different > > scanners: when no hyper-parameter (SVM) optimization is performed, > > everything is straightforward. Problems arise when we would like to perform > > hyperparameter optimization: in this case we need to balance for the > > different scanner in the optimization phase as well. We also found a custom > > cv schema for this, but we are not able to pass it to GridSearchCV object. > > We would like to get something like the following: > > > > pipeline = Pipeline([('scl', StandardScaler()), > > ('sel', RFE(estimator,step=0.2)), > > ('clf', SVC(probability=True, > > random_state=42))]) > > > > > > param_grid = [{'sel__n_features_to_select':[22,15,10,2], > > 'clf__C': np.logspace(-3, 5, 100), > > 'clf__kernel':['linear']}] > > > > clf = GridSearchCV(pipeline, > > param_grid=param_grid, > > verbose=1, > > scoring='roc_auc', > > n_jobs= -1) > > > > # cv_final is the custom cv for the outer loop (9 folds) > > > > ii = 0 > > > > while ii < len(cv_final): > > # fit and predict > > > > clf.fit(data[?]], y[[?]]) > > predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data > > ii = ii + 1 > > > > We tried almost everything. When we define clf in the loop, we pass the > > -ith cv_nested as cv argument, and we fit it on the training data of the > > -ith custom_cv fold, we get an "Too many values to unpack" error. On the > > other end, when we try to pass the nested -ith cv fold as cv argument for > > clf, and we call fit on the same cv_nested fold, we get an "Index out of > > bound" error. > > Two questions: > > 1) Is there any workaround to avoid the split when clf is called without a > > cv argument? > > 2) We suppose that for hyperparameter optimization the test data is removed > > from the dataset and a new dataset is created. Is this true? In this case > > we only have to adjust the indices accordingly > > > > Thank your for your time and sorry for the long text > > Ludovico > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > scikit-learn Info Page - Python > mail.python.org > To see the collection of prior postings to the list, visit the scikit-learn > Archives. Using scikit-learn: To post a message to all the list members ... > > > > > > ------------------------------ > > Subject: Digest Footer > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > scikit-learn Info Page - Python > mail.python.org > To see the collection of prior postings to the list, visit the scikit-learn > Archives. Using scikit-learn: To post a message to all the list members ... > > > > > ------------------------------ > > End of scikit-learn Digest, Vol 11, Issue 29 > ******************************************** > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn