Re: [scikit-learn] Control over the inner loop in GridSearchCV

Sebastian Raschka Mon, 27 Feb 2017 08:29:35 -0800

Hi, Ludovico,
what format (shape) is data in? Are these the arrays from a Kfold iterator? In 
this case, the “question marks” in your code snippet should simply be the train 
and validation subset indices generated by the KFold generator. E.g.,


skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1)
for outer_train_idx, outer_valid_idx in skfold:
    …
    gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx])

> 
> On the other end, when we try to pass the nested -ith cv fold as cv argument 
> for clf, and we call fit on the same cv_nested fold, we get an "Index out of 
> bound" error.  
> Two questions: 

Are you using an version older than scikit-learn 0.18? Techically, the 
GridSearchCV, RandomizedSearchCV, cross_val_score … should all support 
iterables that of train_ and test_indices e.g.:

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

for name, gs_est in sorted(gridcvs.items()):
    nested_score = cross_val_score(gs_est,                 
    X=X_train,                      
    y=y_train,                                 
   cv=outer_cv,                             
   n_jobs=1)


Best,
Sebastian

> On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <[email protected]> wrote:
> 
> Dear Scikit experts,
> 
> we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we 
> hope you will. 
> 
> We are analysing neuroimaging data coming from 3 different MRI scanners, 
> where for each scanner we have a healthy group and a disease group. We would 
> like to merge the data from the 3 different scanners in order to classify the 
> healthy subjects from the one who have the disease. 
> 
> The problem is that we can almost perfectly classify the subjects according 
> to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We 
> are using a custom cross validation schema to account for the different 
> scanners: when no hyper-parameter (SVM) optimization is performed, everything 
> is straightforward. Problems arise when we would like to perform 
> hyperparameter optimization: in this case we need to balance for the 
> different scanner in the optimization phase as well. We also found a custom 
> cv schema for this, but we are not able to pass it to GridSearchCV object. We 
> would like to get something like the following:
> 
> pipeline = Pipeline([('scl', StandardScaler()),
>                     ('sel', RFE(estimator,step=0.2)),       
>                                     ('clf', SVC(probability=True, 
> random_state=42))])
>                      
>                      
> param_grid = [{'sel__n_features_to_select':[22,15,10,2],
>                            'clf__C': np.logspace(-3, 5, 100), 
>                    'clf__kernel':['linear']}]
> 
> clf = GridSearchCV(pipeline, 
>                           param_grid=param_grid, 
>                   verbose=1, 
>                                   scoring='roc_auc', 
>                   n_jobs= -1)
> 
> # cv_final is the custom cv for the outer loop (9 folds)
> 
> ii = 0
> 
> while ii < len(cv_final):  
> # fit and predict
> 
> clf.fit(data[?]], y[[?]])
> predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
> ii = ii + 1
> 
> We tried almost everything. When we define clf in the loop, we pass the -ith 
> cv_nested as cv argument, and we fit it on the training data of the -ith 
> custom_cv fold, we get an "Too many values to unpack" error. On the other 
> end, when we try to pass the nested -ith cv fold as cv argument for clf, and 
> we call fit on the same cv_nested fold, we get an "Index out of bound" error. 
>  
> Two questions: 
> 1) Is there any workaround to avoid the split when clf is called without a cv 
> argument? 
> 2) We suppose that for hyperparameter optimization the test data is removed 
> from the dataset and a  new dataset is created. Is this true? In this case we 
> only have to adjust the indices accordingly
> 
> Thank your for your time and sorry for the long text
> Ludovico
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Control over the inner loop in GridSearchCV

Reply via email to