Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Raga Markely Mon, 30 Jan 2017 11:51:07 -0800

Hi Sebastian,

Following up on the original question on repeated Grid Search CV, I tried
to do repeated nested loop using the followings:
N_outer=10
N_inner=10
scores=[]
for i in range(N_outer):
    k_fold_outer = StratifiedKFold(n_splits=10,shuffle=True,random_state=i)
    for j in range(N_inner):
        k_fold_inner =
StratifiedKFold(n_splits=10,shuffle=True,random_state=j)
        gs = GridSearchCV(estimator=pipe_svc,
param_grid=param_grid,cv=k_fold_inner)
        score=cross_val_score(estimator=gs,X=X,y=y,cv=k_fold_outer)
        scores.append(score)
np.mean(scores)
np.std(scores)


But, I get the following error: TypeError: 'StratifiedKFold' object is not
iterable

I did some trials, and the error is gone when I remove cv=k_fold_inner from
gs = ...
Could you give me some tips on what I can do?

Thank you!
Raga



On Fri, Jan 27, 2017 at 1:16 PM, Raga Markely <raga.mark...@gmail.com>
wrote:

> Hi Sebastian,
>
> Sorry, I used the wrong terms (I was referring to algo as model).. great
> then, i think what i have is aligned with your workflow..
>
> Thank you very much for your help!
>
> Have a good weekend,
> Raga
>
> On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
>
>> Hi, Raga,
>>
>> sounds good, but I am wondering a bit about the order. 2) should come
>> before 1), right? Because model selection is basically done via hyperparam
>> optimization.
>>
>> Not saying that this is the optimal/right approach, but I usually do it
>> like this:
>>
>> 1.) algo selection via nested cv
>> 2.) model selection based on best algo via k-fold on whole training set
>> 3.) fit best algo w. best hyperparams (from 2.) to whole training set
>> 4.) evaluate on test set
>> 5.) fit classifier to whole dataset, done
>>
>> Best,
>> Sebastian
>>
>> > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka <
>> m...@sebastianraschka.com> wrote:
>> >
>> > Hi, Raga,
>> >
>> > sounds good, but I am wondering a bit about the order. 2) should come
>> before 1), right? Because model selection is basically done via hyperparam
>> optimization.
>> >
>> > Not saying that this is the optimal/right approach, but I usually do it
>> like this:
>> >
>> > 1.) algo selection via nested cv
>> > 2.) model selection based on best algo via k-fold on whole training set
>> > 3.) fit best algo w. best hyperparams (from 2.) to whole training set
>> > 4.) evaluate on test set
>> > 5.) fit classifier to whole dataset, done
>> >
>> > Best,
>> > Sebastian
>> >
>> >> On Jan 27, 2017, at 10:23 AM, Raga Markely <raga.mark...@gmail.com>
>> wrote:
>> >>
>> >> Sounds good, Sebastian.. thanks for the suggestions..
>> >>
>> >> My dataset is relatively small (only ~35 samples), and this is the
>> workflow I have set up so far..
>> >> 1. Model selection: use nested loop using
>> cross_val_score(GridSearchCV(...),...) same as shown in the scikit-learn
>> page that you provided - the results show no statistically significant
>> difference in accuracy mean +/- SD among classifiers.. this is expected as
>> the pattern is pretty obvious and simple to separate by eyes after
>> dimensionality reduction (I use pipeline of stdscaler, LDA, and
>> classifier)... so i take all of them and use voting classifier in step #3..
>> >> 2. Hyperparameter optimization: use GridSearchCV to optimize
>> hyperparameters of each classifiers
>> >> 3. Decision Region: use the hyperparameters from step #2, fit each
>> classifier separately to the whole dataset, and use voting classifier to
>> get decision region
>> >>
>> >> This sounds reasonable?
>> >>
>> >> Thank you very much!
>> >> Raga
>> >>
>> >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka <
>> se.rasc...@gmail.com> wrote:
>> >> You are welcome! And in addition, if you select among different
>> algorithms, here are some more suggestions
>> >>
>> >> a) don’t do it based on your independent test set if this is going to
>> your final model performance estimate, or be aware that it would be overly
>> optimistic
>> >> b) also, it’s not the best idea to select algorithms using
>> cross-validation on the same training set that you used for model
>> selection; a more robust way would be nested CV (e.g,.
>> http://scikit-learn.org/stable/auto_examples/model_selection
>> /plot_nested_cross_validation_iris.html)
>> >>
>> >> But yeah, it all depends on your dataset and size. If you have a
>> neural net that takes week to train, and if you have a large dataset anyway
>> so that you can set aside large sets for testing, I’d train on
>> train/validation splits and evaluate on the test set. And to compare e.g.,
>> two networks against each other on large test sets, you could do a McNemar
>> test.
>> >>
>> >> Best,
>> >> Sebastian
>> >>
>> >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <raga.mark...@gmail.com>
>> wrote:
>> >>>
>> >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian!
>> >>>
>> >>> Best,
>> >>> Raga
>> >>>
>> >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka <
>> se.rasc...@gmail.com> wrote:
>> >>> Hi, Raga,
>> >>>
>> >>> I think that if GridSearchCV is used for classification, the
>> stratified k-fold doesn’t do shuffling by default.
>> >>>
>> >>> Say you do 20 grid search repetitions, you could then do sth like:
>> >>>
>> >>>
>> >>> from sklearn.model_selection import StratifiedKFold
>> >>>
>> >>> for i in range(n_reps):
>> >>>    k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i)
>> >>>    gs = GridSearchCV(..., cv=k_fold)
>> >>>    ...
>> >>>
>> >>> Best,
>> >>> Sebastian
>> >>>
>> >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <raga.mark...@gmail.com>
>> wrote:
>> >>>>
>> >>>> Hello,
>> >>>>
>> >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought
>> that each time I call GridSearchCV, the training and test sets separated in
>> different splits would be different.
>> >>>>
>> >>>> However, I got the same best_params_ and best_scores_ for all 20
>> repeats. It looks like the training and test sets are separated in
>> identical folds in each run? Just to clarify, e.g. I have the following
>> data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv =
>> 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I
>> couldn't get [1,3] [0,2,4] or other combinations.
>> >>>>
>> >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I
>> enter cv = integer. The StratifiedKFold command has random state; I wonder
>> if there is anyway I can make the the training and test sets randomly
>> separated each time I call the GridSearchCV?
>> >>>>
>> >>>> Just a note, I used the following classifiers: Logistic Regression,
>> KNN, SVC, Kernel SVC, Random Forest, and had the same observation
>> regardless of the classifiers.
>> >>>>
>> >>>> Thank you very much!
>> >>>> Raga
>> >>>>
>> >>>> _______________________________________________
>> >>>> scikit-learn mailing list
>> >>>> scikit-learn@python.org
>> >>>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>>
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn@python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>>
>> >>> _______________________________________________
>> >>> scikit-learn mailing list
>> >>> scikit-learn@python.org
>> >>> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >>
>> >> _______________________________________________
>> >> scikit-learn mailing list
>> >> scikit-learn@python.org
>> >> https://mail.python.org/mailman/listinfo/scikit-learn
>> >
>> > _______________________________________________
>> > scikit-learn mailing list
>> > scikit-learn@python.org
>> > https://mail.python.org/mailman/listinfo/scikit-learn
>>
>> _______________________________________________
>> scikit-learn mailing list
>> scikit-learn@python.org
>> https://mail.python.org/mailman/listinfo/scikit-learn
>>
>
>

_______________________________________________
scikit-learn mailing list
scikit-learn@python.org
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] Random StratifiedKFold Grid Search CV

Reply via email to