Re: [Scikit-learn-general] The proper way to do nested Cross Validation with (Randomidized/)GridSearchCV pipelines

Joel Nothman Tue, 29 Sep 2015 02:47:21 -0700

FWIW, we are currently in the process of reviewing a redeveloped
cross-validation module to handle this exact issue. The number of samples
will no longer be provided to the new KFold constructor. Instead you would
use cv = KFold(10), then RandomizedSearchCV would call cv.split(X, y) so
that KFold's split method can use statistics from the passed data, not from
constructor parameters. This should be available not in the upcoming
release, but in the one after that.


On 29 September 2015 at 05:06, Philip Tully <tu...@csc.kth.se> wrote:

> Sebastian,
>
> Many thanks - your suggestion matches my intuition, and this is how I'll
> proceed from here!
>
> best wishes,
> Philip
>
>
> On Monday, September 28, 2015, Sebastian Raschka <se.rasc...@gmail.com>
> wrote:
>
>> Hi, Philipp,
>>
>> >  (Randomized/)GridSearchCV to 'optimize' the hyperparameters
>> > of my estimator. However, if I want to do model selection after this,
>>
>> Essentially, the hyperparameter tuning is already your model selection
>> step since you couple the (Randomized/)GridSearchCV with some performance
>> metric. So, let's say via GridSearch, you find that inverse regularization
>> param C=0.1 and a RBF kernel width of gamma=100 give you the best, e.g.,
>> ROC auc. If you know "use" those C and gamma values, you effectively
>> selected your model, which you can than further evaluate on an independent
>> test set (if you have kept one).
>>
>> Now, if you are interested in comparing different learning algorithms,
>> e.g., tree-based methods, linear models, kernel SVM, then I'd definitely
>> recommend to use nested cross validation, for example, as you already did:
>>
>> > search = RandomizedSearchCV(pipeline,
>> > param_distributions=param_dist, n_iter=5)
>> >
>> > cross_val_score(search, X, y)
>>
>> I am not sure why you encounter this error in your second example, I'd
>> have to think about it more, but I suspect
>>
>> maybe try to initialize 2 separate cross-val objects, for example
>>
>> > cv=sklearn.cross_validation.KFold(len(X), 10)
>> >
>>
>>
>> Let's say you have 100 training points. In the outer loop, you split it
>> into 10 folds, then you pass 9 folds to the inner loop. So, you inner loop
>> effectively has 90 training samples, which is why for the inner loop
>> "len(x)" in
>>
>> > cv=sklearn.cross_validation.KFold(len(X), 10)
>>
>>
>> is not true anymore. Maybe try
>>
>> > cv_inner=sklearn.cross_validation.KFold(len(X) - len(X)/10, 10)
>>
>>
>> Reading further down your email, it sounds like this is what you have
>> done and it worked?
>>
>> >  Another question is that after I get the
>> > relevant unbiased scores to report, if I want to get the best
>> > classifier would I then have to go back and fit my full dataset using
>> > the second KFold object in the initialization of RandomizedSearchCV?
>>
>> So, if you do nested cross-validation, a few things can happen... For
>> example, let's say you tuned and evaluated an RBF kernel SVM with respect
>> to C and gamma. For simplicity, let's talk about a 100 sample training set
>> with 10 inner and 10 outer folds. In the inner loop, you tune your model
>> via GridSearch & cross-validation on the 90 training samples. Let's say you
>> find that a model gamma=0.1 and C=10 works "best". Next, this model is
>> evaluated on the 10 remaining validation samples of your outer loop. You
>> keep this validation score and advance to the next outer loop fold. Again,
>> you pass 90 samples -- these are different now -- to the inner loop. If you
>> model is stable, you may find that gamma=0.1 and C=10 also give you the
>> "best" inner CV results. Then you evaluate this model , tuned in the inner
>> fold, on the hold-out data (now also different) of the outer loop. If your
>> model is unstable, you may get different values for gamma and C in the
>> inner loop though, for example gamma=1.0 and C=100. After you repeated this
>> 10 times, you have 10 validation scores from the outer loop that you can
>> average to get a (relatively) unbiased estimate of your model's
>> performance. However, you may also have different models associated with
>> each validation score.
>>
>> In practice, you would repeat this nested CV for different algorithms
>> you'd like to compare and select the model & algorithm that gives you the
>> "best" unbiased estimate (average of the outer loop validation scores).
>> After that, you select this "best" learning algorithm and tune it again via
>> "regular" cross validation to find good hyperparameters. If you want to use
>> you algorithm for some sort of real-world application, you maybe also want
>> to train it (without further tuning) on all your available data after all
>> the evaluation is done.
>>
>>
>> Best,
>> Sebastian
>>
>> > On Sep 27, 2015, at 4:20 PM, Philip Tully <tu...@csc.kth.se> wrote:
>> >
>> > Hi all,
>> >
>> > My question is mostly technical, but part ML best practice. I am
>> > performing (Randomized/)GridSearchCV to 'optimize' the hyperparameters
>> > of my estimator. However, if I want to do model selection after this,
>> > it would be best to do nested cross-validation to get a more unbiased
>> > estimate and avoid issues like overoptimistic score reporting as
>> > discussed in these papers:
>> >
>> > 1) G. C. Cawley and N. L. C. Talbot, Over-fitting in model selection
>> > and subsequent selection bias in performance evaluation, Journal of
>> > Machine Learning Research, 2010. Research, vol. 11, pp. 2079-2107,
>> > July 2010.
>> > 2) Varma, Sudhir, and Richard Simon. "Bias in error estimation when
>> > using cross-validation for model selection." BMC bioinformatics 7.1
>> > (2006): 91.
>> >
>> > Luckily, sklearn allows me to do this via cross_val_score, as
>> > described here:
>> >
>> http://scikit-learn.org/stable/tutorial/statistical_inference/model_selection.html
>> >
>> > But the documentation is a little thin and I want to make sure that I
>> > am doing this correctly. Here is simple running code that does this
>> > straightaway (afaict):
>> > ______________________________________________
>> > import numpy as np
>> > import sklearn
>> > from sklearn.grid_search import RandomizedSearchCV
>> > from sklearn.datasets import load_digits
>> > from sklearn.cross_validation import cross_val_score
>> > from sklearn.svm import SVC
>> > from sklearn.preprocessing import StandardScaler
>> > from sklearn.pipeline import Pipeline
>> >
>> > # get some data
>> > iris = load_digits()
>> > X, y = iris.data, iris.target
>> >
>> > param_dist = {
>> >          'rbf_svm__C': [1, 10, 100, 1000],
>> >          'rbf_svm__gamma': [0.001, 0.0001],
>> >          'rbf_svm__kernel': ['rbf', 'linear'],
>> > }
>> >
>> > steps = [('scaler', StandardScaler()), ('rbf_svm', SVC())]
>> > pipeline = Pipeline(steps)
>> >
>> > search = RandomizedSearchCV(pipeline,
>> > param_distributions=param_dist, n_iter=5)
>> >
>> > cross_val_score(search, X, y)
>> > ______________________________________________
>> >
>> > Now this is all well and good, HOWEVER, when I want to be more
>> > specific about what kind of cross validation procedures I want to run,
>> > I can set cv=sklearn.cross_validation.KFold(len(X), 10) and pass this
>> > both to RandomizedSearchCV AND cross_val_score.
>> >
>> > But if I do this, I often get errors that look like this:
>> >
>> > /Library/Python/2.7/site-packages/sklearn/utils/__init__.pyc in
>> > safe_indexing(X, indices)
>> >    155                                    indices.dtype.kind == 'i'):
>> >    156             # This is often substantially faster than X[indices]
>> > --> 157             return X.take(indices, axis=0)
>> >    158         else:
>> >    159             return X[indices]
>> >
>> > IndexError: index 1617 is out of bounds for size 1617
>> >
>> > This makes sense to me after thinking about it actually, because the
>> > first argument in KFold should be different between the inner CV and
>> > outer CV when they are nested. For example, If I split my data into
>> > k=10 folds in the outer CV, then the inner CV should use training data
>> > that is the size of only 9 of the outer CV folds. Is this logical?
>> >
>> > It turns out if I assume this and test the boundary conditions for
>> > 9/10 of the original training data, my hypothesis seems correct and
>> > the nested cv runs like a charm. You can test it yourself if you set
>> > the cv argument of RandomizedSearchCV and cross_val_score to,
>> > respectively above:
>> > cv=sklearn.cross_validation.KFold(min([len(a) for a,b in
>> > sklearn.cross_validation.KFold(len(X), 10)], 10)
>> > cv=sklearn.cross_validation.KFold(len(X), 10)
>> >
>> > Note that the inner CV is based on the lowest number of elements in a
>> > fold to do CV over in the case where it is not evenly divisible by
>> > k=10. This probably leaves out a few data points but it is the best I
>> > can do without crashing the program with the above error message
>> > (since it seems the 'n' arg in KFold cannot be dynamically set).
>> >
>> > This seems messy, and may not be the best way to go about doing this.
>> > My question is, is there a better way of accomplishing this if I want
>> > to do nested 10-fold cross validation using cross_val_score with a
>> > RandomizedSearchCV pipeline? Another question is that after I get the
>> > relevant unbiased scores to report, if I want to get the best
>> > classifier would I then have to go back and fit my full dataset using
>> > the second KFold object in the initialization of RandomizedSearchCV?
>> > best_estimator_ is only available after I fit the RandomizedSearchCV
>> > it seems, even if I have already called cross_val_score...
>> >
>> > kind regards,
>> > Philip
>> >
>> >
>> ------------------------------------------------------------------------------
>> > _______________________________________________
>> > Scikit-learn-general mailing list
>> > Scikit-learn-general@lists.sourceforge.net
>> > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>
>>
>> ------------------------------------------------------------------------------
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
>
> --
> Skickat från min iPhone
>
>
> ------------------------------------------------------------------------------
>
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] The proper way to do nested Cross Validation with (Randomidized/)GridSearchCV pipelines

Reply via email to