Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Sebastian Raschka Thu, 26 Jan 2017 17:33:12 -0800

You are welcome! And in addition, if you select among different algorithms, 
here are some more suggestions


a) don’t do it based on your independent test set if this is going to your 
final model performance estimate, or be aware that it would be overly optimistic
b) also, it’s not the best idea to select algorithms using cross-validation on 
the same training set that you used for model selection; a more robust way 
would be nested CV (e.g,. 
http://scikit-learn.org/stable/auto_examples/model_selection/plot_nested_cross_validation_iris.html)

But yeah, it all depends on your dataset and size. If you have a neural net 
that takes week to train, and if you have a large dataset anyway so that you 
can set aside large sets for testing, I’d train on train/validation splits and 
evaluate on the test set. And to compare e.g., two networks against each other 
on large test sets, you could do a McNemar test.

Best,
Sebastian

> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.mark...@gmail.com> wrote:
> 
> Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
> 
> Best,
> Raga
> 
> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <se.rasc...@gmail.com> 
> wrote:
> Hi, Raga,
> 
> I think that if GridSearchCV is used for classification, the stratified 
> k-fold doesn’t do shuffling by default.
> 
> Say you do 20 grid search repetitions, you could then do sth like:
> 
> 
> from sklearn.model_selection import StratifiedKFold
> 
> for i in range(n_reps):
>     k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
>     gs = GridSearchCV(..., cv=k_fold)
>     ...
> 
> Best,
> Sebastian
> 
> > On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.mark...@gmail.com> wrote:
> >
> > Hello,
> >
> > I was trying to do repeated Grid Search CV (20 repeats). I thought that 
> > each time I call GridSearchCV, the training and test sets separated in 
> > different splits would be different.
> >
> > However, I got the same best_params_ and best_scores_ for all 20 repeats. 
> > It looks like the training and test sets are separated in identical folds 
> > in each run? Just to clarify, e.g. I have the following data: 0,1,2,3,4. 
> > Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = 2. The split is 
> > always for instance [0,3] [1,2,4] in each repeat, and I couldn't get [1,3] 
> > [0,2,4] or other combinations.
> >
> > If I understand correctly, GridSearchCV uses StratifiedKFold when I enter 
> > cv = integer. The StratifiedKFold command has random state; I wonder if 
> > there is anyway I can make the the training and test sets randomly 
> > separated each time I call the GridSearchCV?
> >
> > Just a note, I used the following classifiers: Logistic Regression, KNN, 
> > SVC, Kernel SVC, Random Forest, and had the same observation regardless of 
> > the classifiers.
> >
> > Thank you very much!
> > Raga
> >
> > _______________________________________________
> > scikit-learn mailing list
> > scikit-learn@python.org
> > https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn
> 
> _______________________________________________
> scikit-learn mailing list
> scikit-learn@python.org
> https://mail.python.org/mailman/listinfo/scikit-learn

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Reply via email to