Re: [Scikit-learn-general] The proper way to do nested Cross Validation with (Randomidized/)GridSearchCV pipelines

Sebastian Raschka Mon, 28 Sep 2015 11:15:09 -0700

Hi, Philipp,

>  (Randomized/)GridSearchCV to 'optimize' the hyperparameters
> of my estimator. However, if I want to do model selection after this,

Essentially, the hyperparameter tuning is already your model selection step 
since you couple the (Randomized/)GridSearchCV with some performance metric. 
So, let's say via GridSearch, you find that inverse regularization param C=0.1 
and a RBF kernel width of gamma=100 give you the best, e.g., ROC auc. If you 
know "use" those C and gamma values, you effectively selected your model, which 
you can than further evaluate on an independent test set (if you have kept one).

Now, if you are interested in comparing different learning algorithms, e.g., 
tree-based methods, linear models, kernel SVM, then I'd definitely recommend to 
use nested cross validation, for example, as you already did:

> search = RandomizedSearchCV(pipeline,
> param_distributions=param_dist, n_iter=5)
> 
> cross_val_score(search, X, y)

I am not sure why you encounter this error in your second example, I'd have to 
think about it more, but I suspect

maybe try to initialize 2 separate cross-val objects, for example

> cv=sklearn.cross_validation.KFold(len(X), 10)
> 

Let's say you have 100 training points. In the outer loop, you split it into 10 
folds, then you pass 9 folds to the inner loop. So, you inner loop effectively 
has 90 training samples, which is why for the inner loop "len(x)" in

> cv=sklearn.cross_validation.KFold(len(X), 10)

is not true anymore. Maybe try 

> cv_inner=sklearn.cross_validation.KFold(len(X) - len(X)/10, 10)

Reading further down your email, it sounds like this is what you have done and 
it worked?

>  Another question is that after I get the
> relevant unbiased scores to report, if I want to get the best
> classifier would I then have to go back and fit my full dataset using
> the second KFold object in the initialization of RandomizedSearchCV?

So, if you do nested cross-validation, a few things can happen... For example, 
let's say you tuned and evaluated an RBF kernel SVM with respect to C and 
gamma. For simplicity, let's talk about a 100 sample training set with 10 inner 
and 10 outer folds. In the inner loop, you tune your model via GridSearch & 
cross-validation on the 90 training samples. Let's say you find that a model 
gamma=0.1 and C=10 works "best". Next, this model is evaluated on the 10 
remaining validation samples of your outer loop. You keep this validation score 
and advance to the next outer loop fold. Again, you pass 90 samples -- these 
are different now -- to the inner loop. If you model is stable, you may find 
that gamma=0.1 and C=10 also give you the "best" inner CV results. Then you 
evaluate this model , tuned in the inner fold, on the hold-out data (now also 
different) of the outer loop. If your model is unstable, you may get different 
values for gamma and C in the inner loop though, for example gamma=1.0 a
 nd C=100. After you repeated this 10 times, you have 10 validation scores from 
the outer loop that you can average to get a (relatively) unbiased estimate of 
your model's performance. However, you may also have different models 
associated with each validation score. 

In practice, you would repeat this nested CV for different algorithms you'd 
like to compare and select the model & algorithm that gives you the "best" 
unbiased estimate (average of the outer loop validation scores). After that, 
you select this "best" learning algorithm and tune it again via "regular" cross 
validation to find good hyperparameters. If you want to use you algorithm for 
some sort of real-world application, you maybe also want to train it (without 
further tuning) on all your available data after all the evaluation is done.

Best,
Sebastian

> On Sep 27, 2015, at 4:20 PM, Philip Tully <[email protected]> wrote:
> 
> Hi all,
> 
> My question is mostly technical, but part ML best practice. I am
> performing (Randomized/)GridSearchCV to 'optimize' the hyperparameters
> of my estimator. However, if I want to do model selection after this,
> it would be best to do nested cross-validation to get a more unbiased
> estimate and avoid issues like overoptimistic score reporting as
> discussed in these papers:
> 
> 1) G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection
> and subsequent selection bias in performance evaluation, Journal of
> Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107,
> July 2010.
> 2) Varma, Sudhir, and Richard Simon. "Bias in error estimation when
> using cross-validation for model selection." BMC bioinformatics 7.1
> (2006): 91.
> 
> Luckily, sklearn allows me to do this via cross_val_score, as
> described here:
> http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
> 
> But the documentation is a little thin and I want to make sure that I
> am doing this correctly. Here is simple running code that does this
> straightaway (afaict):
> ______________________________________________
> import numpy as np
> import sklearn
> from sklearn.grid_search import RandomizedSearchCV
> from sklearn.datasets import load_digits
> from sklearn.cross_validation import cross_val_score
> from sklearn.svm import SVC
> from sklearn.preprocessing import StandardScaler
> from sklearn.pipeline import Pipeline
> 
> # get some data
> iris = load_digits()
> X, y = iris.data, iris.target
> 
> param_dist = {
>          'rbf_svm__C': [1, 10, 100, 1000],
>          'rbf_svm__gamma': [0.001, 0.0001],
>          'rbf_svm__kernel': ['rbf', 'linear'],
> }
> 
> steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())]
> pipeline = Pipeline(steps)
> 
> search = RandomizedSearchCV(pipeline,
> param_distributions=param_dist, n_iter=5)
> 
> cross_val_score(search, X, y)
> ______________________________________________
> 
> Now this is all well and good, HOWEVER, when I want to be more
> specific about what kind of cross validation procedures I want to run,
> I can set cv=sklearn.cross_validation.KFold(len(X), 10) and pass this
> both to RandomizedSearchCV AND cross_val_score.
> 
> But if I do this, I often get errors that look like this:
> 
> /Library/Python/2.7/site-packages/sklearn/utils/__init__.pyc in
> safe_indexing(X, indices)
>    155                                    indices.dtype.kind == 'i'):
>    156             # This is often substantially faster than X[indices]
> --> 157             return X.take(indices, axis=0)
>    158         else:
>    159             return X[indices]
> 
> IndexError: index 1617 is out of bounds for size 1617
> 
> This makes sense to me after thinking about it actually, because the
> first argument in KFold should be different between the inner CV and
> outer CV when they are nested. For example, If I split my data into
> k=10 folds in the outer CV, then the inner CV should use training data
> that is the size of only 9 of the outer CV folds. Is this logical?
> 
> It turns out if I assume this and test the boundary conditions for
> 9/10 of the original training data, my hypothesis seems correct and
> the nested cv runs like a charm. You can test it yourself if you set
> the cv argument of RandomizedSearchCV and cross_val_score to,
> respectively above:
> cv=sklearn.cross_validation.KFold(min([len(a) for a,b in
> sklearn.cross_validation.KFold(len(X), 10)], 10)
> cv=sklearn.cross_validation.KFold(len(X), 10)
> 
> Note that the inner CV is based on the lowest number of elements in a
> fold to do CV over in the case where it is not evenly divisible by
> k=10. This probably leaves out a few data points but it is the best I
> can do without crashing the program with the above error message
> (since it seems the 'n' arg in KFold cannot be dynamically set).
> 
> This seems messy, and may not be the best way to go about doing this.
> My question is, is there a better way of accomplishing this if I want
> to do nested 10-fold cross validation using cross_val_score with a
> RandomizedSearchCV pipeline? Another question is that after I get the
> relevant unbiased scores to report, if I want to get the best
> classifier would I then have to go back and fit my full dataset using
> the second KFold object in the initialization of RandomizedSearchCV?
> best_estimator_ is only available after I fit the RandomizedSearchCV
> it seems, even if I have already called cross_val_score...
> 
> kind regards,
> Philip
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] The proper way to do nested Cross Validation with (Randomidized/)GridSearchCV pipelines

Reply via email to