Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Sebastian Raschka Mon, 30 Jan 2017 13:07:00 -0800

Cool, glad to hear that it was such an easy fix :)

> On Jan 30, 2017, at 3:49 PM, Raga Markely <raga.mark...@gmail.com> wrote:
> 
> Nice catch!! The sklearn was 0.18, but I used sklearn.grid_search instead of 
> sklearn.model_selection.
> 
> Error is gone now.
> 
> Thank you, Sebastian!
> Raga
> 
> On Mon, Jan 30, 2017 at 3:37 PM, Sebastian Raschka <se.rasc...@gmail.com> 
> wrote:
> Hm, which version of scikit-learn are you using? Are you running this on 
> sklearn 0.18?
> 
> Best,
> Sebastian
> 
> > On Jan 30, 2017, at 2:48 PM, Raga Markely <raga.mark...@gmail.com> wrote:
> >
> > Hi Sebastian,
> >
> > Following up on the original question on repeated Grid Search CV, I tried 
> > to do repeated nested loop using the followings:
> > N_outer=10
> > N_inner=10
> > scores=[]
> > for i in range(N_outer):
> >     k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i)
> >     for j in range(N_inner):
> >         k_fold_inner = 
> > StratifiedKFold(n_splits=10,shuffle=True,random_state=j)
> >         gs = GridSearchCV(estimator=pipe_svc, 
> > param_grid=param_grid,cv=k_fold_inner)
> >         score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer)
> >         scores.append(score)
> > np.mean(scores)
> > np.std(scores)
> >
> > But, I get the following error: TypeError: 'StratifiedKFold' object is not 
> > iterable
> >
> > I did some trials, and the error is gone when I remove cv=k_fold_inner from 
> > gs = ...
> > Could you give me some tips on what I can do?
> >
> > Thank you!
> > Raga
> >
> >
> >
> > On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.mark...@gmail.com> 
> > wrote:
> > Hi Sebastian,
> >
> > Sorry, I used the wrong terms (I was referring to algo as model).. great 
> > then, i think what i have is aligned with your workflow..
> >
> > Thank you very much for your help!
> >
> > Have a good weekend,
> > Raga
> >
> > On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.rasc...@gmail.com> 
> > wrote:
> > Hi, Raga,
> >
> > sounds good, but I am wondering a bit about the order. 2) should come 
> > before 1), right? Because model selection is basically done via hyperparam 
> > optimization.
> >
> > Not saying that this is the optimal/right approach, but I usually do it 
> > like this:
> >
> > 1.) algo selection via nested cv
> > 2.) model selection based on best algo via k-fold on whole training set
> > 3.) fit best algo w. best hyperparams (from 2.) to whole training set
> > 4.) evaluate on test set
> > 5.) fit classifier to whole dataset, done
> >
> > Best,
> > Sebastian
> >
> > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka 
> > > <m...@sebastianraschka.com> wrote:
> > >
> > > Hi, Raga,
> > >
> > > sounds good, but I am wondering a bit about the order. 2) should come 
> > > before 1), right? Because model selection is basically done via 
> > > hyperparam optimization.
> > >
> > > Not saying that this is the optimal/right approach, but I usually do it 
> > > like this:
> > >
> > > 1.) algo selection via nested cv
> > > 2.) model selection based on best algo via k-fold on whole training set
> > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set
> > > 4.) evaluate on test set
> > > 5.) fit classifier to whole dataset, done
> > >
> > > Best,
> > > Sebastian
> > >
> > >> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.mark...@gmail.com> 
> > >> wrote:
> > >>
> > >> Sounds good, Sebastian.. thanks for the suggestions..
> > >>
> > >> My dataset is relatively small (only ~35 samples), and this is the 
> > >> workflow I have set up so far..
> > >> 1. Model selection: use nested loop using 
> > >> cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn 
> > >> page that you provided - the results show no statistically significant 
> > >> difference in accuracy mean +/- SD among classifiers.. this is expected 
> > >> as the pattern is pretty obvious and simple to separate by eyes after 
> > >> dimensionality reduction (I use pipeline of stdscaler, LDA, and 
> > >> classifier)... so i take all of them and use voting classifier in step 
> > >> #3..
> > >> 2. Hyperparameter optimization: use GridSearchCV to optimize 
> > >> hyperparameters of each classifiers
> > >> 3. Decision Region: use the hyperparameters from step #2, fit each 
> > >> classifier separately to the whole dataset, and use voting classifier to 
> > >> get decision region
> > >>
> > >> This sounds reasonable?
> > >>
> > >> Thank you very much!
> > >> Raga
> > >>
> > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka 
> > >> <se.rasc...@gmail.com> wrote:
> > >> You are welcome! And in addition, if you select among different 
> > >> algorithms, here are some more suggestions
> > >>
> > >> a) don’t do it based on your independent test set if this is going to 
> > >> your final model performance estimate, or be aware that it would be 
> > >> overly optimistic
> > >> b) also, it’s not the best idea to select algorithms using 
> > >> cross-validation on the same training set that you used for model 
> > >> selection; a more robust way would be nested CV (e.g,. 
> > >> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
> > >>
> > >> But yeah, it all depends on your dataset and size. If you have a neural 
> > >> net that takes week to train, and if you have a large dataset anyway so 
> > >> that you can set aside large sets for testing, I’d train on 
> > >> train/validation splits and evaluate on the test set. And to compare 
> > >> e.g., two networks against each other on large test sets, you could do a 
> > >> McNemar test.
> > >>
> > >> Best,
> > >> Sebastian
> > >>
> > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.mark...@gmail.com> 
> > >>> wrote:
> > >>>
> > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
> > >>>
> > >>> Best,
> > >>> Raga
> > >>>
> > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka 
> > >>> <se.rasc...@gmail.com> wrote:
> > >>> Hi, Raga,
> > >>>
> > >>> I think that if GridSearchCV is used for classification, the stratified 
> > >>> k-fold doesn’t do shuffling by default.
> > >>>
> > >>> Say you do 20 grid search repetitions, you could then do sth like:
> > >>>
> > >>>
> > >>> from sklearn.model_selection import StratifiedKFold
> > >>>
> > >>> for i in range(n_reps):
> > >>>    k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
> > >>>    gs = GridSearchCV(..., cv=k_fold)
> > >>>    ...
> > >>>
> > >>> Best,
> > >>> Sebastian
> > >>>
> > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.mark...@gmail.com> 
> > >>>> wrote:
> > >>>>
> > >>>> Hello,
> > >>>>
> > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought 
> > >>>> that each time I call GridSearchCV, the training and test sets 
> > >>>> separated in different splits would be different.
> > >>>>
> > >>>> However, I got the same best_params_ and best_scores_ for all 20 
> > >>>> repeats. It looks like the training and test sets are separated in 
> > >>>> identical folds in each run? Just to clarify, e.g. I have the 
> > >>>> following data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. 
> > >>>> Suppose I call cv = 2. The split is always for instance [0,3] [1,2,4] 
> > >>>> in each repeat, and I couldn't get [1,3] [0,2,4] or other combinations.
> > >>>>
> > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I 
> > >>>> enter cv = integer. The StratifiedKFold command has random state; I 
> > >>>> wonder if there is anyway I can make the the training and test sets 
> > >>>> randomly separated each time I call the GridSearchCV?
> > >>>>
> > >>>> Just a note, I used the following classifiers: Logistic Regression, 
> > >>>> KNN, SVC, Kernel SVC, Random Forest, and had the same observation 
> > >>>> regardless of the classifiers.
> > >>>>
> > >>>> Thank you very much!
> > >>>> Raga
> > >>>>
> > >>>> _______________________________________________
> > >>>> scikit-learn mailing list
> > >>>> scikit-learn@python.org
> > >>>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>>
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn@python.org
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>>
> > >>> _______________________________________________
> > >>> scikit-learn mailing list
> > >>> scikit-learn@python.org
> > >>> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn@python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >>
> > >> _______________________________________________
> > >> scikit-learn mailing list
> > >> scikit-learn@python.org
> > >> https://mail.python.org/mailman/listinfo/scikit-learn
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn


_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Reply via email to