Re: [Scikit-learn-general] Nesting of stratified crossvalidation

Christoph Sawade Fri, 30 Oct 2015 03:36:37 -0700

Thanks for the response. I am actually interested in the new
DisjointLabelKFold (https://github.com/scikit-learn/scikit-learn/pull/4444)
which depends on an additional label. This use case seems to be not yet
covered in the new sklearn.model_selection, is it?


> Changes to support this case have recently been merged into master, and an
> example is on its way:
> https://github.com/scikit-learn/scikit-learn/issues/5589
>
> I think you should be able to run your code by importing GridSearchCV,
> cross_val_score and StratifiedShuffleSplit from the new
> sklearn.model_selection, then the code is identical except you drop the
`y`
> argument from StratifiedShuffleSplit's constructor (it's a different
class,
> actually).
>
> Please do try it out!
>
> On 29 October 2015 at 05:00, Christoph Sawade <
> christoph.saw...@googlemail.com> wrote:
>
>> Hey there!
>>
>> A general purpose in machine learning when training a model is to
estimate
>> also the performance. This is often done via cross validation. In order
to
>> tune also hyperparameters one might want to nest the crossvalidation
loops
>> into another. The sklearn framework makes that very easy. However,
>> sometimes it is necessary to stratify the folds to ensure some constrains
>> (e.g., roughly some proportion of the target label in each fold). These
>> splitters are also provided (e.g., StratifiedShuffleSplit) but do not
work
>> when they are nested:
>>
>> import numpy as np
>> from sklearn.grid_search import GridSearchCV
>> from sklearn.cross_validation import StratifiedShuffleSplit
>> from sklearn.linear_model import LogisticRegression
>> from sklearn.cross_validation import cross_val_score
>>
>> # Number of samples per component
>> n_samples = 1000
>>
>> # Generate random sample, two classes
>> X = np.r_[
>>     np.dot(np.random.randn(n_samples, 2), np.array([[0., -0.1], [1.7,
>> .4]])),
>>     np.dot(np.random.randn(n_samples, 2), np.array([[1.0, 0.0], [0.0,
>> 1.0]])) + np.array([-2, 2])
>> ]
>> y = np.concatenate([np.ones(n_samples), -np.ones(n_samples)])
>>
>> # Fit model
>> LogRegOptimalC = GridSearchCV(
>>     estimator=LogisticRegression(),
>>     cv = StratifiedShuffleSplit(y, 3, test_size=0.5, random_state=0),
>>     param_grid={
>>         'C': np.logspace(-3, 3, 7)
>>     }
>> )
>> print cross_val_score(LogRegOptimalC, X, y, cv=5).mean()
>>
>> The problem seems to be that the array reflecting the splitting criterion
>> (here the target y) is not splitted for the inner folds. Is there some
way
>> to tackle that or are there already initiatives dealing with it?
>>
>> Thx Christoph
>>
>>
>> ------------------------------------------------------------
------------------
>>
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>>

------------------------------------------------------------------------------

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Nesting of stratified crossvalidation

Reply via email to