Hi Sebastian, Following up on the original question on repeated Grid Search CV, I tried to do repeated nested loop using the followings: N_outer=10 N_inner=10 scores=[] for i in range(N_outer): k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i) for j in range(N_inner): k_fold_inner = StratifiedKFold(n_splits=10,shuffle=True,random_state=j) gs = GridSearchCV(estimator=pipe_svc, param_grid=param_grid,cv=k_fold_inner) score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer) scores.append(score) np.mean(scores) np.std(scores)
But, I get the following error: TypeError: 'StratifiedKFold' object is not iterable I did some trials, and the error is gone when I remove cv=k_fold_inner from gs = ... Could you give me some tips on what I can do? Thank you! Raga On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.mark...@gmail.com> wrote: > Hi Sebastian, > > Sorry, I used the wrong terms (I was referring to algo as model).. great > then, i think what i have is aligned with your workflow.. > > Thank you very much for your help! > > Have a good weekend, > Raga > > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.rasc...@gmail.com> > wrote: > >> Hi, Raga, >> >> sounds good, but I am wondering a bit about the order. 2) should come >> before 1), right? Because model selection is basically done via hyperparam >> optimization. >> >> Not saying that this is the optimal/right approach, but I usually do it >> like this: >> >> 1.) algo selection via nested cv >> 2.) model selection based on best algo via k-fold on whole training set >> 3.) fit best algo w. best hyperparams (from 2.) to whole training set >> 4.) evaluate on test set >> 5.) fit classifier to whole dataset, done >> >> Best, >> Sebastian >> >> > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < >> m...@sebastianraschka.com> wrote: >> > >> > Hi, Raga, >> > >> > sounds good, but I am wondering a bit about the order. 2) should come >> before 1), right? Because model selection is basically done via hyperparam >> optimization. >> > >> > Not saying that this is the optimal/right approach, but I usually do it >> like this: >> > >> > 1.) algo selection via nested cv >> > 2.) model selection based on best algo via k-fold on whole training set >> > 3.) fit best algo w. best hyperparams (from 2.) to whole training set >> > 4.) evaluate on test set >> > 5.) fit classifier to whole dataset, done >> > >> > Best, >> > Sebastian >> > >> >> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.mark...@gmail.com> >> wrote: >> >> >> >> Sounds good, Sebastian.. thanks for the suggestions.. >> >> >> >> My dataset is relatively small (only ~35 samples), and this is the >> workflow I have set up so far.. >> >> 1. Model selection: use nested loop using >> cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn >> page that you provided - the results show no statistically significant >> difference in accuracy mean +/- SD among classifiers.. this is expected as >> the pattern is pretty obvious and simple to separate by eyes after >> dimensionality reduction (I use pipeline of stdscaler, LDA, and >> classifier)... so i take all of them and use voting classifier in step #3.. >> >> 2. Hyperparameter optimization: use GridSearchCV to optimize >> hyperparameters of each classifiers >> >> 3. Decision Region: use the hyperparameters from step #2, fit each >> classifier separately to the whole dataset, and use voting classifier to >> get decision region >> >> >> >> This sounds reasonable? >> >> >> >> Thank you very much! >> >> Raga >> >> >> >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < >> se.rasc...@gmail.com> wrote: >> >> You are welcome! And in addition, if you select among different >> algorithms, here are some more suggestions >> >> >> >> a) don’t do it based on your independent test set if this is going to >> your final model performance estimate, or be aware that it would be overly >> optimistic >> >> b) also, it’s not the best idea to select algorithms using >> cross-validation on the same training set that you used for model >> selection; a more robust way would be nested CV (e.g,. >> http://scikit-learn.org/stable/auto_examples/model_selection >> /plot_nested_cross_validation_iris.html) >> >> >> >> But yeah, it all depends on your dataset and size. If you have a >> neural net that takes week to train, and if you have a large dataset anyway >> so that you can set aside large sets for testing, I’d train on >> train/validation splits and evaluate on the test set. And to compare e.g., >> two networks against each other on large test sets, you could do a McNemar >> test. >> >> >> >> Best, >> >> Sebastian >> >> >> >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.mark...@gmail.com> >> wrote: >> >>> >> >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! >> >>> >> >>> Best, >> >>> Raga >> >>> >> >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < >> se.rasc...@gmail.com> wrote: >> >>> Hi, Raga, >> >>> >> >>> I think that if GridSearchCV is used for classification, the >> stratified k-fold doesn’t do shuffling by default. >> >>> >> >>> Say you do 20 grid search repetitions, you could then do sth like: >> >>> >> >>> >> >>> from sklearn.model_selection import StratifiedKFold >> >>> >> >>> for i in range(n_reps): >> >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) >> >>> gs = GridSearchCV(..., cv=k_fold) >> >>> ... >> >>> >> >>> Best, >> >>> Sebastian >> >>> >> >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.mark...@gmail.com> >> wrote: >> >>>> >> >>>> Hello, >> >>>> >> >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought >> that each time I call GridSearchCV, the training and test sets separated in >> different splits would be different. >> >>>> >> >>>> However, I got the same best_params_ and best_scores_ for all 20 >> repeats. It looks like the training and test sets are separated in >> identical folds in each run? Just to clarify, e.g. I have the following >> data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = >> 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I >> couldn't get [1,3] [0,2,4] or other combinations. >> >>>> >> >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I >> enter cv = integer. The StratifiedKFold command has random state; I wonder >> if there is anyway I can make the the training and test sets randomly >> separated each time I call the GridSearchCV? >> >>>> >> >>>> Just a note, I used the following classifiers: Logistic Regression, >> KNN, SVC, Kernel SVC, Random Forest, and had the same observation >> regardless of the classifiers. >> >>>> >> >>>> Thank you very much! >> >>>> Raga >> >>>> >> >>>> _______________________________________________ >> >>>> scikit-learn mailing list >> >>>> scikit-learn@python.org >> >>>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn@python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >>> >> >>> _______________________________________________ >> >>> scikit-learn mailing list >> >>> scikit-learn@python.org >> >>> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn@python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> >> >> >> _______________________________________________ >> >> scikit-learn mailing list >> >> scikit-learn@python.org >> >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn@python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn