Hi Sebastian, Sorry, I used the wrong terms (I was referring to algo as model).. great then, i think what i have is aligned with your workflow..
Thank you very much for your help! Have a good weekend, Raga On Fri, Jan 27, 2017 at 1:01 PM, Sebastian Raschka <[email protected]> wrote: > Hi, Raga, > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > 1.) algo selection via nested cv > 2.) model selection based on best algo via k-fold on whole training set > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > 4.) evaluate on test set > 5.) fit classifier to whole dataset, done > > Best, > Sebastian > > > On Jan 27, 2017, at 12:49 PM, Sebastian Raschka < > [email protected]> wrote: > > > > Hi, Raga, > > > > sounds good, but I am wondering a bit about the order. 2) should come > before 1), right? Because model selection is basically done via hyperparam > optimization. > > > > Not saying that this is the optimal/right approach, but I usually do it > like this: > > > > 1.) algo selection via nested cv > > 2.) model selection based on best algo via k-fold on whole training set > > 3.) fit best algo w. best hyperparams (from 2.) to whole training set > > 4.) evaluate on test set > > 5.) fit classifier to whole dataset, done > > > > Best, > > Sebastian > > > >> On Jan 27, 2017, at 10:23 AM, Raga Markely <[email protected]> > wrote: > >> > >> Sounds good, Sebastian.. thanks for the suggestions.. > >> > >> My dataset is relatively small (only ~35 samples), and this is the > workflow I have set up so far.. > >> 1. Model selection: use nested loop using > >> cross_val_score(GridSearchCV(...),...) > same as shown in the scikit-learn page that you provided - the results show > no statistically significant difference in accuracy mean +/- SD among > classifiers.. this is expected as the pattern is pretty obvious and simple > to separate by eyes after dimensionality reduction (I use pipeline of > stdscaler, LDA, and classifier)... so i take all of them and use voting > classifier in step #3.. > >> 2. Hyperparameter optimization: use GridSearchCV to optimize > hyperparameters of each classifiers > >> 3. Decision Region: use the hyperparameters from step #2, fit each > classifier separately to the whole dataset, and use voting classifier to > get decision region > >> > >> This sounds reasonable? > >> > >> Thank you very much! > >> Raga > >> > >> On Thu, Jan 26, 2017 at 8:31 PM, Sebastian Raschka < > [email protected]> wrote: > >> You are welcome! And in addition, if you select among different > algorithms, here are some more suggestions > >> > >> a) don’t do it based on your independent test set if this is going to > your final model performance estimate, or be aware that it would be overly > optimistic > >> b) also, it’s not the best idea to select algorithms using > cross-validation on the same training set that you used for model > selection; a more robust way would be nested CV (e.g,. > http://scikit-learn.org/stable/auto_examples/model_ > selection/plot_nested_cross_validation_iris.html) > >> > >> But yeah, it all depends on your dataset and size. If you have a neural > net that takes week to train, and if you have a large dataset anyway so > that you can set aside large sets for testing, I’d train on > train/validation splits and evaluate on the test set. And to compare e.g., > two networks against each other on large test sets, you could do a McNemar > test. > >> > >> Best, > >> Sebastian > >> > >>> On Jan 26, 2017, at 8:09 PM, Raga Markely <[email protected]> > wrote: > >>> > >>> Ahh.. nice.. I will use that.. thanks a lot, Sebastian! > >>> > >>> Best, > >>> Raga > >>> > >>> On Thu, Jan 26, 2017 at 6:34 PM, Sebastian Raschka < > [email protected]> wrote: > >>> Hi, Raga, > >>> > >>> I think that if GridSearchCV is used for classification, the > stratified k-fold doesn’t do shuffling by default. > >>> > >>> Say you do 20 grid search repetitions, you could then do sth like: > >>> > >>> > >>> from sklearn.model_selection import StratifiedKFold > >>> > >>> for i in range(n_reps): > >>> k_fold = StratifiedKFold(n_splits=5, shuffle=True, random_state=i) > >>> gs = GridSearchCV(..., cv=k_fold) > >>> ... > >>> > >>> Best, > >>> Sebastian > >>> > >>>> On Jan 26, 2017, at 5:39 PM, Raga Markely <[email protected]> > wrote: > >>>> > >>>> Hello, > >>>> > >>>> I was trying to do repeated Grid Search CV (20 repeats). I thought > that each time I call GridSearchCV, the training and test sets separated in > different splits would be different. > >>>> > >>>> However, I got the same best_params_ and best_scores_ for all 20 > repeats. It looks like the training and test sets are separated in > identical folds in each run? Just to clarify, e.g. I have the following > data: 0,1,2,3,4. Class 1 = [0,1,2] and Class 2 = [3,4]. Suppose I call cv = > 2. The split is always for instance [0,3] [1,2,4] in each repeat, and I > couldn't get [1,3] [0,2,4] or other combinations. > >>>> > >>>> If I understand correctly, GridSearchCV uses StratifiedKFold when I > enter cv = integer. The StratifiedKFold command has random state; I wonder > if there is anyway I can make the the training and test sets randomly > separated each time I call the GridSearchCV? > >>>> > >>>> Just a note, I used the following classifiers: Logistic Regression, > KNN, SVC, Kernel SVC, Random Forest, and had the same observation > regardless of the classifiers. > >>>> > >>>> Thank you very much! > >>>> Raga > >>>> > >>>> _______________________________________________ > >>>> scikit-learn mailing list > >>>> [email protected] > >>>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> [email protected] > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >>> > >>> _______________________________________________ > >>> scikit-learn mailing list > >>> [email protected] > >>> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> [email protected] > >> https://mail.python.org/mailman/listinfo/scikit-learn > >> > >> _______________________________________________ > >> scikit-learn mailing list > >> [email protected] > >> https://mail.python.org/mailman/listinfo/scikit-learn > > > > _______________________________________________ > > scikit-learn mailing list > > [email protected] > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > [email protected] > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list [email protected] https://mail.python.org/mailman/listinfo/scikit-learn
