Thanks for the response. I am actually interested in the new DisjointLabelKFold (https://github.com/scikit-learn/scikit-learn/pull/4444) which depends on an additional label. This use case seems to be not yet covered in the new sklearn.model_selection, is it?
> Changes to support this case have recently been merged into master, and an > example is on its way: > https://github.com/scikit-learn/scikit-learn/issues/5589 > > I think you should be able to run your code by importing GridSearchCV, > cross_val_score and StratifiedShuffleSplit from the new > sklearn.model_selection, then the code is identical except you drop the `y` > argument from StratifiedShuffleSplit's constructor (it's a different class, > actually). > > Please do try it out! > > On 29 October 2015 at 05:00, Christoph Sawade < > christoph.saw...@googlemail.com> wrote: > >> Hey there! >> >> A general purpose in machine learning when training a model is to estimate >> also the performance. This is often done via cross validation. In order to >> tune also hyperparameters one might want to nest the crossvalidation loops >> into another. The sklearn framework makes that very easy. However, >> sometimes it is necessary to stratify the folds to ensure some constrains >> (e.g., roughly some proportion of the target label in each fold). These >> splitters are also provided (e.g., StratifiedShuffleSplit) but do not work >> when they are nested: >> >> import numpy as np >> from sklearn.grid_search import GridSearchCV >> from sklearn.cross_validation import StratifiedShuffleSplit >> from sklearn.linear_model import LogisticRegression >> from sklearn.cross_validation import cross_val_score >> >> # Number of samples per component >> n_samples = 1000 >> >> # Generate random sample, two classes >> X = np.r_[ >> np.dot(np.random.randn(n_samples, 2), np.array([[0., -0.1], [1.7, >> .4]])), >> np.dot(np.random.randn(n_samples, 2), np.array([[1.0, 0.0], [0.0, >> 1.0]])) + np.array([-2, 2]) >> ] >> y = np.concatenate([np.ones(n_samples), -np.ones(n_samples)]) >> >> # Fit model >> LogRegOptimalC = GridSearchCV( >> estimator=LogisticRegression(), >> cv = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0), >> param_grid={ >> 'C': np.logspace(-3, 3, 7) >> } >> ) >> print cross_val_score(LogRegOptimalC, X, y, cv=5).mean() >> >> The problem seems to be that the array reflecting the splitting criterion >> (here the target y) is not splitted for the inner folds. Is there some way >> to tackle that or are there already initiatives dealing with it? >> >> Thx Christoph >> >> >> ------------------------------------------------------------ ------------------ >> >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> >>
------------------------------------------------------------------------------
_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general