Sebastian, Many thanks - your suggestion matches my intuition, and this is how I'll proceed from here!
best wishes, Philip On Monday, September 28, 2015, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Hi, Philipp, > > > (Randomized/)GridSearchCV to 'optimize' the hyperparameters > > of my estimator. However, if I want to do model selection after this, > > Essentially, the hyperparameter tuning is already your model selection > step since you couple the (Randomized/)GridSearchCV with some performance > metric. So, let's say via GridSearch, you find that inverse regularization > param C=0.1 and a RBF kernel width of gamma=100 give you the best, e.g., > ROC auc. If you know "use" those C and gamma values, you effectively > selected your model, which you can than further evaluate on an independent > test set (if you have kept one). > > Now, if you are interested in comparing different learning algorithms, > e.g., tree-based methods, linear models, kernel SVM, then I'd definitely > recommend to use nested cross validation, for example, as you already did: > > > search = RandomizedSearchCV(pipeline, > > param_distributions=param_dist, n_iter=5) > > > > cross_val_score(search, X, y) > > I am not sure why you encounter this error in your second example, I'd > have to think about it more, but I suspect > > maybe try to initialize 2 separate cross-val objects, for example > > > cv=sklearn.cross_validation.KFold(len(X), 10) > > > > > Let's say you have 100 training points. In the outer loop, you split it > into 10 folds, then you pass 9 folds to the inner loop. So, you inner loop > effectively has 90 training samples, which is why for the inner loop > "len(x)" in > > > cv=sklearn.cross_validation.KFold(len(X), 10) > > > is not true anymore. Maybe try > > > cv_inner=sklearn.cross_validation.KFold(len(X) - len(X)/10, 10) > > > Reading further down your email, it sounds like this is what you have done > and it worked? > > > Another question is that after I get the > > relevant unbiased scores to report, if I want to get the best > > classifier would I then have to go back and fit my full dataset using > > the second KFold object in the initialization of RandomizedSearchCV? > > So, if you do nested cross-validation, a few things can happen... For > example, let's say you tuned and evaluated an RBF kernel SVM with respect > to C and gamma. For simplicity, let's talk about a 100 sample training set > with 10 inner and 10 outer folds. In the inner loop, you tune your model > via GridSearch & cross-validation on the 90 training samples. Let's say you > find that a model gamma=0.1 and C=10 works "best". Next, this model is > evaluated on the 10 remaining validation samples of your outer loop. You > keep this validation score and advance to the next outer loop fold. Again, > you pass 90 samples -- these are different now -- to the inner loop. If you > model is stable, you may find that gamma=0.1 and C=10 also give you the > "best" inner CV results. Then you evaluate this model , tuned in the inner > fold, on the hold-out data (now also different) of the outer loop. If your > model is unstable, you may get different values for gamma and C in the > inner loop though, for example gamma=1.0 and C=100. After you repeated this > 10 times, you have 10 validation scores from the outer loop that you can > average to get a (relatively) unbiased estimate of your model's > performance. However, you may also have different models associated with > each validation score. > > In practice, you would repeat this nested CV for different algorithms > you'd like to compare and select the model & algorithm that gives you the > "best" unbiased estimate (average of the outer loop validation scores). > After that, you select this "best" learning algorithm and tune it again via > "regular" cross validation to find good hyperparameters. If you want to use > you algorithm for some sort of real-world application, you maybe also want > to train it (without further tuning) on all your available data after all > the evaluation is done. > > > Best, > Sebastian > > > On Sep 27, 2015, at 4:20 PM, Philip Tully <tu...@csc.kth.se > <javascript:;>> wrote: > > > > Hi all, > > > > My question is mostly technical, but part ML best practice. I am > > performing (Randomized/)GridSearchCV to 'optimize' the hyperparameters > > of my estimator. However, if I want to do model selection after this, > > it would be best to do nested cross-validation to get a more unbiased > > estimate and avoid issues like overoptimistic score reporting as > > discussed in these papers: > > > > 1) G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection > > and subsequent selection bias in performance evaluation, Journal of > > Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, > > July 2010. > > 2) Varma, Sudhir, and Richard Simon. "Bias in error estimation when > > using cross-validation for model selection." BMC bioinformatics 7.1 > > (2006): 91. > > > > Luckily, sklearn allows me to do this via cross_val_score, as > > described here: > > > http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html > > > > But the documentation is a little thin and I want to make sure that I > > am doing this correctly. Here is simple running code that does this > > straightaway (afaict): > > ______________________________________________ > > import numpy as np > > import sklearn > > from sklearn.grid_search import RandomizedSearchCV > > from sklearn.datasets import load_digits > > from sklearn.cross_validation import cross_val_score > > from sklearn.svm import SVC > > from sklearn.preprocessing import StandardScaler > > from sklearn.pipeline import Pipeline > > > > # get some data > > iris = load_digits() > > X, y = iris.data, iris.target > > > > param_dist = { > > 'rbf_svm__C': [1, 10, 100, 1000], > > 'rbf_svm__gamma': [0.001, 0.0001], > > 'rbf_svm__kernel': ['rbf', 'linear'], > > } > > > > steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())] > > pipeline = Pipeline(steps) > > > > search = RandomizedSearchCV(pipeline, > > param_distributions=param_dist, n_iter=5) > > > > cross_val_score(search, X, y) > > ______________________________________________ > > > > Now this is all well and good, HOWEVER, when I want to be more > > specific about what kind of cross validation procedures I want to run, > > I can set cv=sklearn.cross_validation.KFold(len(X), 10) and pass this > > both to RandomizedSearchCV AND cross_val_score. > > > > But if I do this, I often get errors that look like this: > > > > /Library/Python/2.7/site-packages/sklearn/utils/__init__.pyc in > > safe_indexing(X, indices) > > 155 indices.dtype.kind == 'i'): > > 156 # This is often substantially faster than X[indices] > > --> 157 return X.take(indices, axis=0) > > 158 else: > > 159 return X[indices] > > > > IndexError: index 1617 is out of bounds for size 1617 > > > > This makes sense to me after thinking about it actually, because the > > first argument in KFold should be different between the inner CV and > > outer CV when they are nested. For example, If I split my data into > > k=10 folds in the outer CV, then the inner CV should use training data > > that is the size of only 9 of the outer CV folds. Is this logical? > > > > It turns out if I assume this and test the boundary conditions for > > 9/10 of the original training data, my hypothesis seems correct and > > the nested cv runs like a charm. You can test it yourself if you set > > the cv argument of RandomizedSearchCV and cross_val_score to, > > respectively above: > > cv=sklearn.cross_validation.KFold(min([len(a) for a,b in > > sklearn.cross_validation.KFold(len(X), 10)], 10) > > cv=sklearn.cross_validation.KFold(len(X), 10) > > > > Note that the inner CV is based on the lowest number of elements in a > > fold to do CV over in the case where it is not evenly divisible by > > k=10. This probably leaves out a few data points but it is the best I > > can do without crashing the program with the above error message > > (since it seems the 'n' arg in KFold cannot be dynamically set). > > > > This seems messy, and may not be the best way to go about doing this. > > My question is, is there a better way of accomplishing this if I want > > to do nested 10-fold cross validation using cross_val_score with a > > RandomizedSearchCV pipeline? Another question is that after I get the > > relevant unbiased scores to report, if I want to get the best > > classifier would I then have to go back and fit my full dataset using > > the second KFold object in the initialization of RandomizedSearchCV? > > best_estimator_ is only available after I fit the RandomizedSearchCV > > it seems, even if I have already called cross_val_score... > > > > kind regards, > > Philip > > > > > ------------------------------------------------------------------------------ > > _______________________________________________ > > Scikit-learn-general mailing list > > Scikit-learn-general@lists.sourceforge.net <javascript:;> > > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > > > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net <javascript:;> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general > -- Skickat från min iPhone
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general