Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Sebastian Raschka Fri, 27 Jan 2017 09:52:46 -0800

Hi, Raga,

sounds good, but I am wondering a bit about the order. 2) should come before 
1), right? Because model selection is basically done via hyperparam 
optimization.


Not saying that this is the optimal/right approach, but I usually do it like 
this:

1.) algo selection via nested cv
2.) model selection based on best algo via k-fold on whole training set
3.) fit best algo w. best hyperparams (from 2.) to whole training set
4.) evaluate on test set
5.) fit classifier to whole dataset, done

Best,
Sebastian

> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.mark...@gmail.com> wrote:
> 
> Sounds good, Sebastian.. thanks for the suggestions..
> 
> My dataset is relatively small (only ~35 samples), and this is the workflow I 
> have set up so far.. 
> 1. Model selection: use nested loop using 
> cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn page 
> that you provided - the results show no statistically significant difference 
> in accuracy mean +/- SD among classifiers.. this is expected as the pattern 
> is pretty obvious and simple to separate by eyes after dimensionality 
> reduction (I use pipeline of stdscaler, LDA, and classifier)... so i take all 
> of them and use voting classifier in step #3..
> 2. Hyperparameter optimization: use GridSearchCV to optimize hyperparameters 
> of each classifiers
> 3. Decision Region: use the hyperparameters from step #2, fit each classifier 
> separately to the whole dataset, and use voting classifier to get decision 
> region
> 
> This sounds reasonable?
> 
> Thank you very much!
> Raga
> 
> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <se.rasc...@gmail.com> 
> wrote:
> You are welcome! And in addition, if you select among different algorithms, 
> here are some more suggestions
> 
> a) don’t do it based on your independent test set if this is going to your 
> final model performance estimate, or be aware that it would be overly 
> optimistic
> b) also, it’s not the best idea to select algorithms using cross-validation 
> on the same training set that you used for model selection; a more robust way 
> would be nested CV (e.g,. 
> http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
> 
> But yeah, it all depends on your dataset and size. If you have a neural net 
> that takes week to train, and if you have a large dataset anyway so that you 
> can set aside large sets for testing, I’d train on train/validation splits 
> and evaluate on the test set. And to compare e.g., two networks against each 
> other on large test sets, you could do a McNemar test.
> 
> Best,
> Sebastian
> 
> > On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.mark...@gmail.com> wrote:
> >
> > Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
> >
> > Best,
> > Raga
> >
> > On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.rasc...@gmail.com> 
> > wrote:
> > Hi, Raga,
> >
> > I think that if GridSearchCV is used for classification, the stratified 
> > k-fold doesn’t do shuffling by default.
> >
> > Say you do 20 grid search repetitions, you could then do sth like:
> >
> >
> > from sklearn.model_selection import StratifiedKFold
> >
> > for i in range(n_reps):
> >     k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
> >     gs = GridSearchCV(..., cv=k_fold)
> >     ...
> >
> > Best,
> > Sebastian
> >
> > > On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.mark...@gmail.com> wrote:
> > >
> > > Hello,
> > >
> > > I was trying to do repeated Grid Search CV (20 repeats). I thought that 
> > > each time I call GridSearchCV, the training and test sets separated in 
> > > different splits would be different.
> > >
> > > However, I got the same best_params_ and best_scores_ for all 20 repeats. 
> > > It looks like the training and test sets are separated in identical folds 
> > > in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. 
> > > Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split 
> > > is always for instance [0,3] [1,2,4] in each repeat, and I couldn't get 
> > > [1,3] [0,2,4] or other combinations.
> > >
> > > If I understand correctly, GridSearchCV uses StratifiedKFold when I enter 
> > > cv = integer. The StratifiedKFold command has random state; I wonder if 
> > > there is anyway I can make the the training and test sets randomly 
> > > separated each time I call the GridSearchCV?
> > >
> > > Just a note, I used the following classifiers: Logistic Regression, KNN, 
> > > SVC, Kernel SVC, Random Forest, and had the same observation regardless 
> > > of the classifiers.
> > >
> > > Thank you very much!
> > > Raga
> > >
> > > _______________________________________________
> > > scikit-learn mailing list
> > > scikit-learn@python.org
> > > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Reply via email to