Hey there! A general purpose in machine learning when training a model is to estimate also the performance. This is often done via cross validation. In order to tune also hyperparameters one might want to nest the crossvalidation loops into another. The sklearn framework makes that very easy. However, sometimes it is necessary to stratify the folds to ensure some constrains (e.g., roughly some proportion of the target label in each fold). These splitters are also provided (e.g., StratifiedShuffleSplit) but do not work when they are nested:
import numpy as np from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import StratifiedShuffleSplit from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import cross_val_score # Number of samples per component n_samples = 1000 # Generate random sample, two classes X = np.r_[ np.dot(np.random.randn(n_samples, 2), np.array([[0., -0.1], [1.7, .4]])), np.dot(np.random.randn(n_samples, 2), np.array([[1.0, 0.0], [0.0, 1.0]])) + np.array([-2, 2]) ] y = np.concatenate([np.ones(n_samples), -np.ones(n_samples)]) # Fit model LogRegOptimalC = GridSearchCV( estimator=LogisticRegression(), cv = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0), param_grid={ 'C': np.logspace(-3, 3, 7) } ) print cross_val_score(LogRegOptimalC, X, y, cv=5).mean() The problem seems to be that the array reflecting the splitting criterion (here the target y) is not splitted for the inner folds. Is there some way to tackle that or are there already initiatives dealing with it? Thx Christoph
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general