Hi all,

My question is mostly technical, but part ML best practice. I am
performing (Randomized/)GridSearchCV to 'optimize' the hyperparameters
of my estimator. However, if I want to do model selection after this,
it would be best to do nested cross-validation to get a more unbiased
estimate and avoid issues like overoptimistic score reporting as
discussed in these papers:

1) G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection
and subsequent selection bias in performance evaluation, Journal of
Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107,
July 2010.
2) Varma, Sudhir, and Richard Simon. "Bias in error estimation when
using cross-validation for model selection." BMC bioinformatics 7.1
(2006): 91.

Luckily, sklearn allows me to do this via cross_val_score, as
described here:
http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html

But the documentation is a little thin and I want to make sure that I
am doing this correctly. Here is simple running code that does this
straightaway (afaict):
______________________________________________
import numpy as np
import sklearn
from sklearn.grid_search import RandomizedSearchCV
from sklearn.datasets import load_digits
from sklearn.cross_validation import cross_val_score
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# get some data
iris = load_digits()
X, y = iris.data, iris.target

param_dist = {
          'rbf_svm__C': [1, 10, 100, 1000],
          'rbf_svm__gamma': [0.001, 0.0001],
          'rbf_svm__kernel': ['rbf', 'linear'],
}

steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())]
pipeline = Pipeline(steps)

search = RandomizedSearchCV(pipeline,
param_distributions=param_dist, n_iter=5)

cross_val_score(search, X, y)
______________________________________________

Now this is all well and good, HOWEVER, when I want to be more
specific about what kind of cross validation procedures I want to run,
I can set cv=sklearn.cross_validation.KFold(len(X), 10) and pass this
both to RandomizedSearchCV AND cross_val_score.

But if I do this, I often get errors that look like this:

/Library/Python/2.7/site-packages/sklearn/utils/__init__.pyc in
safe_indexing(X, indices)
    155                                    indices.dtype.kind == 'i'):
    156             # This is often substantially faster than X[indices]
--> 157             return X.take(indices, axis=0)
    158         else:
    159             return X[indices]

IndexError: index 1617 is out of bounds for size 1617

This makes sense to me after thinking about it actually, because the
first argument in KFold should be different between the inner CV and
outer CV when they are nested. For example, If I split my data into
k=10 folds in the outer CV, then the inner CV should use training data
that is the size of only 9 of the outer CV folds. Is this logical?

It turns out if I assume this and test the boundary conditions for
9/10 of the original training data, my hypothesis seems correct and
the nested cv runs like a charm. You can test it yourself if you set
the cv argument of RandomizedSearchCV and cross_val_score to,
respectively above:
cv=sklearn.cross_validation.KFold(min([len(a) for a,b in
sklearn.cross_validation.KFold(len(X), 10)], 10)
cv=sklearn.cross_validation.KFold(len(X), 10)

Note that the inner CV is based on the lowest number of elements in a
fold to do CV over in the case where it is not evenly divisible by
k=10. This probably leaves out a few data points but it is the best I
can do without crashing the program with the above error message
(since it seems the 'n' arg in KFold cannot be dynamically set).

This seems messy, and may not be the best way to go about doing this.
My question is, is there a better way of accomplishing this if I want
to do nested 10-fold cross validation using cross_val_score with a
RandomizedSearchCV pipeline? Another question is that after I get the
relevant unbiased scores to report, if I want to get the best
classifier would I then have to go back and fit my full dataset using
the second KFold object in the initialization of RandomizedSearchCV?
best_estimator_ is only available after I fit the RandomizedSearchCV
it seems, even if I have already called cross_val_score...

kind regards,
Philip

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to