Hey there!

A general purpose in machine learning when training a model is to estimate
also the performance. This is often done via cross validation. In order to
tune also hyperparameters one might want to nest the crossvalidation loops
into another. The sklearn framework makes that very easy. However,
sometimes it is necessary to stratify the folds to ensure some constrains
(e.g., roughly some proportion of the target label in each fold). These
splitters are also provided (e.g., StratifiedShuffleSplit) but do not work
when they are nested:

import numpy as np
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedShuffleSplit
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

# Number of samples per component
n_samples = 1000

# Generate random sample, two classes
X = np.r_[
    np.dot(np.random.randn(n_samples, 2), np.array([[0., -0.1], [1.7,
.4]])),
    np.dot(np.random.randn(n_samples, 2), np.array([[1.0, 0.0], [0.0,
1.0]])) + np.array([-2, 2])
]
y = np.concatenate([np.ones(n_samples), -np.ones(n_samples)])

# Fit model
LogRegOptimalC = GridSearchCV(
    estimator=LogisticRegression(),
    cv = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0),
    param_grid={
        'C': np.logspace(-3, 3, 7)
    }
)
print cross_val_score(LogRegOptimalC, X, y, cv=5).mean()

The problem seems to be that the array reflecting the splitting criterion
(here the target y) is not splitted for the inner folds. Is there some way
to tackle that or are there already initiatives dealing with it?

Thx Christoph
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to