Re: [Scikit-learn-general] The proper way to do nested Cross Validation with (Randomidized/)GridSearchCV pipelines

Philip Tully Mon, 28 Sep 2015 12:09:08 -0700

Sebastian,

Many thanks - your suggestion matches my intuition, and this is how I'll
proceed from here!


best wishes,
Philip

On Monday, September 28, 2015, Sebastian Raschka <se.rasc...@gmail.com>
wrote:

> Hi, Philipp,
>
> >  (Randomized/)GridSearchCV to 'optimize' the hyperparameters
> > of my estimator. However, if I want to do model selection after this,
>
> Essentially, the hyperparameter tuning is already your model selection
> step since you couple the (Randomized/)GridSearchCV with some performance
> metric. So, let's say via GridSearch, you find that inverse regularization
> param C=0.1 and a RBF kernel width of gamma=100 give you the best, e.g.,
> ROC auc. If you know "use" those C and gamma values, you effectively
> selected your model, which you can than further evaluate on an independent
> test set (if you have kept one).
>
> Now, if you are interested in comparing different learning algorithms,
> e.g., tree-based methods, linear models, kernel SVM, then I'd definitely
> recommend to use nested cross validation, for example, as you already did:
>
> > search = RandomizedSearchCV(pipeline,
> > param_distributions=param_dist, n_iter=5)
> >
> > cross_val_score(search, X, y)
>
> I am not sure why you encounter this error in your second example, I'd
> have to think about it more, but I suspect
>
> maybe try to initialize 2 separate cross-val objects, for example
>
> > cv=sklearn.cross_validation.KFold(len(X), 10)
> >
>
>
> Let's say you have 100 training points. In the outer loop, you split it
> into 10 folds, then you pass 9 folds to the inner loop. So, you inner loop
> effectively has 90 training samples, which is why for the inner loop
> "len(x)" in
>
> > cv=sklearn.cross_validation.KFold(len(X), 10)
>
>
> is not true anymore. Maybe try
>
> > cv_inner=sklearn.cross_validation.KFold(len(X) - len(X)/10, 10)
>
>
> Reading further down your email, it sounds like this is what you have done
> and it worked?
>
> >  Another question is that after I get the
> > relevant unbiased scores to report, if I want to get the best
> > classifier would I then have to go back and fit my full dataset using
> > the second KFold object in the initialization of RandomizedSearchCV?
>
> So, if you do nested cross-validation, a few things can happen... For
> example, let's say you tuned and evaluated an RBF kernel SVM with respect
> to C and gamma. For simplicity, let's talk about a 100 sample training set
> with 10 inner and 10 outer folds. In the inner loop, you tune your model
> via GridSearch & cross-validation on the 90 training samples. Let's say you
> find that a model gamma=0.1 and C=10 works "best". Next, this model is
> evaluated on the 10 remaining validation samples of your outer loop. You
> keep this validation score and advance to the next outer loop fold. Again,
> you pass 90 samples -- these are different now -- to the inner loop. If you
> model is stable, you may find that gamma=0.1 and C=10 also give you the
> "best" inner CV results. Then you evaluate this model , tuned in the inner
> fold, on the hold-out data (now also different) of the outer loop. If your
> model is unstable, you may get different values for gamma and C in the
> inner loop though, for example gamma=1.0 and C=100. After you repeated this
> 10 times, you have 10 validation scores from the outer loop that you can
> average to get a (relatively) unbiased estimate of your model's
> performance. However, you may also have different models associated with
> each validation score.
>
> In practice, you would repeat this nested CV for different algorithms
> you'd like to compare and select the model & algorithm that gives you the
> "best" unbiased estimate (average of the outer loop validation scores).
> After that, you select this "best" learning algorithm and tune it again via
> "regular" cross validation to find good hyperparameters. If you want to use
> you algorithm for some sort of real-world application, you maybe also want
> to train it (without further tuning) on all your available data after all
> the evaluation is done.
>
>
> Best,
> Sebastian
>
> > On Sep 27, 2015, at 4:20 PM, Philip Tully <tu...@csc.kth.se
> <javascript:;>> wrote:
> >
> > Hi all,
> >
> > My question is mostly technical, but part ML best practice. I am
> > performing (Randomized/)GridSearchCV to 'optimize' the hyperparameters
> > of my estimator. However, if I want to do model selection after this,
> > it would be best to do nested cross-validation to get a more unbiased
> > estimate and avoid issues like overoptimistic score reporting as
> > discussed in these papers:
> >
> > 1) G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection
> > and subsequent selection bias in performance evaluation, Journal of
> > Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107,
> > July 2010.
> > 2) Varma, Sudhir, and Richard Simon. "Bias in error estimation when
> > using cross-validation for model selection." BMC bioinformatics 7.1
> > (2006): 91.
> >
> > Luckily, sklearn allows me to do this via cross_val_score, as
> > described here:
> >
> http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
> >
> > But the documentation is a little thin and I want to make sure that I
> > am doing this correctly. Here is simple running code that does this
> > straightaway (afaict):
> > ______________________________________________
> > import numpy as np
> > import sklearn
> > from sklearn.grid_search import RandomizedSearchCV
> > from sklearn.datasets import load_digits
> > from sklearn.cross_validation import cross_val_score
> > from sklearn.svm import SVC
> > from sklearn.preprocessing import StandardScaler
> > from sklearn.pipeline import Pipeline
> >
> > # get some data
> > iris = load_digits()
> > X, y = iris.data, iris.target
> >
> > param_dist = {
> >          'rbf_svm__C': [1, 10, 100, 1000],
> >          'rbf_svm__gamma': [0.001, 0.0001],
> >          'rbf_svm__kernel': ['rbf', 'linear'],
> > }
> >
> > steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())]
> > pipeline = Pipeline(steps)
> >
> > search = RandomizedSearchCV(pipeline,
> > param_distributions=param_dist, n_iter=5)
> >
> > cross_val_score(search, X, y)
> > ______________________________________________
> >
> > Now this is all well and good, HOWEVER, when I want to be more
> > specific about what kind of cross validation procedures I want to run,
> > I can set cv=sklearn.cross_validation.KFold(len(X), 10) and pass this
> > both to RandomizedSearchCV AND cross_val_score.
> >
> > But if I do this, I often get errors that look like this:
> >
> > /Library/Python/2.7/site-packages/sklearn/utils/__init__.pyc in
> > safe_indexing(X, indices)
> >    155                                    indices.dtype.kind == 'i'):
> >    156             # This is often substantially faster than X[indices]
> > --> 157             return X.take(indices, axis=0)
> >    158         else:
> >    159             return X[indices]
> >
> > IndexError: index 1617 is out of bounds for size 1617
> >
> > This makes sense to me after thinking about it actually, because the
> > first argument in KFold should be different between the inner CV and
> > outer CV when they are nested. For example, If I split my data into
> > k=10 folds in the outer CV, then the inner CV should use training data
> > that is the size of only 9 of the outer CV folds. Is this logical?
> >
> > It turns out if I assume this and test the boundary conditions for
> > 9/10 of the original training data, my hypothesis seems correct and
> > the nested cv runs like a charm. You can test it yourself if you set
> > the cv argument of RandomizedSearchCV and cross_val_score to,
> > respectively above:
> > cv=sklearn.cross_validation.KFold(min([len(a) for a,b in
> > sklearn.cross_validation.KFold(len(X), 10)], 10)
> > cv=sklearn.cross_validation.KFold(len(X), 10)
> >
> > Note that the inner CV is based on the lowest number of elements in a
> > fold to do CV over in the case where it is not evenly divisible by
> > k=10. This probably leaves out a few data points but it is the best I
> > can do without crashing the program with the above error message
> > (since it seems the 'n' arg in KFold cannot be dynamically set).
> >
> > This seems messy, and may not be the best way to go about doing this.
> > My question is, is there a better way of accomplishing this if I want
> > to do nested 10-fold cross validation using cross_val_score with a
> > RandomizedSearchCV pipeline? Another question is that after I get the
> > relevant unbiased scores to report, if I want to get the best
> > classifier would I then have to go back and fit my full dataset using
> > the second KFold object in the initialization of RandomizedSearchCV?
> > best_estimator_ is only available after I fit the RandomizedSearchCV
> > it seems, even if I have already called cross_val_score...
> >
> > kind regards,
> > Philip
> >
> >
> ------------------------------------------------------------------------------
> > _______________________________________________
> > Scikit-learn-general mailing list
> > Scikit-learn-general@lists.sourceforge.net <javascript:;>
> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net <javascript:;>
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>


-- 
Skickat från min iPhone

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] The proper way to do nested Cross Validation with (Randomidized/)GridSearchCV pipelines

Reply via email to