Hi all, My question is mostly technical, but part ML best practice. I am performing (Randomized/)GridSearchCV to 'optimize' the hyperparameters of my estimator. However, if I want to do model selection after this, it would be best to do nested cross-validation to get a more unbiased estimate and avoid issues like overoptimistic score reporting as discussed in these papers:
1) G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection and subsequent selection bias in performance evaluation, Journal of Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107, July 2010. 2) Varma, Sudhir, and Richard Simon. "Bias in error estimation when using cross-validation for model selection." BMC bioinformatics 7.1 (2006): 91. Luckily, sklearn allows me to do this via cross_val_score, as described here: http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html But the documentation is a little thin and I want to make sure that I am doing this correctly. Here is simple running code that does this straightaway (afaict): ______________________________________________ import numpy as np import sklearn from sklearn.grid_search import RandomizedSearchCV from sklearn.datasets import load_digits from sklearn.cross_validation import cross_val_score from sklearn.svm import SVC from sklearn.preprocessing import StandardScaler from sklearn.pipeline import Pipeline # get some data iris = load_digits() X, y = iris.data, iris.target param_dist = { 'rbf_svm__C': [1, 10, 100, 1000], 'rbf_svm__gamma': [0.001, 0.0001], 'rbf_svm__kernel': ['rbf', 'linear'], } steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())] pipeline = Pipeline(steps) search = RandomizedSearchCV(pipeline, param_distributions=param_dist, n_iter=5) cross_val_score(search, X, y) ______________________________________________ Now this is all well and good, HOWEVER, when I want to be more specific about what kind of cross validation procedures I want to run, I can set cv=sklearn.cross_validation.KFold(len(X), 10) and pass this both to RandomizedSearchCV AND cross_val_score. But if I do this, I often get errors that look like this: /Library/Python/2.7/site-packages/sklearn/utils/__init__.pyc in safe_indexing(X, indices) 155 indices.dtype.kind == 'i'): 156 # This is often substantially faster than X[indices] --> 157 return X.take(indices, axis=0) 158 else: 159 return X[indices] IndexError: index 1617 is out of bounds for size 1617 This makes sense to me after thinking about it actually, because the first argument in KFold should be different between the inner CV and outer CV when they are nested. For example, If I split my data into k=10 folds in the outer CV, then the inner CV should use training data that is the size of only 9 of the outer CV folds. Is this logical? It turns out if I assume this and test the boundary conditions for 9/10 of the original training data, my hypothesis seems correct and the nested cv runs like a charm. You can test it yourself if you set the cv argument of RandomizedSearchCV and cross_val_score to, respectively above: cv=sklearn.cross_validation.KFold(min([len(a) for a,b in sklearn.cross_validation.KFold(len(X), 10)], 10) cv=sklearn.cross_validation.KFold(len(X), 10) Note that the inner CV is based on the lowest number of elements in a fold to do CV over in the case where it is not evenly divisible by k=10. This probably leaves out a few data points but it is the best I can do without crashing the program with the above error message (since it seems the 'n' arg in KFold cannot be dynamically set). This seems messy, and may not be the best way to go about doing this. My question is, is there a better way of accomplishing this if I want to do nested 10-fold cross validation using cross_val_score with a RandomizedSearchCV pipeline? Another question is that after I get the relevant unbiased scores to report, if I want to get the best classifier would I then have to go back and fit my full dataset using the second KFold object in the initialization of RandomizedSearchCV? best_estimator_ is only available after I fit the RandomizedSearchCV it seems, even if I have already called cross_val_score... kind regards, Philip ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general