Re: [scikit-learn] scikit-learn Digest, Vol 11, Issue 29

Ludovico Coletta Mon, 27 Feb 2017 14:15:40 -0800

Dear Sebastian,

thank you for the quick answer.

The data is stored in a numpy array (shape: 68, 24). We are using scikit 18.1

I saw that I wrote something wrong in previous email. Your solution is indeed 
correct if we leave Scikit decide how to manage the inner loop. This is what we 
did at the beginning. By doing so, we noticed that the classifier's perfomance 
decrease (in comparison to a non-optimised classifier). We would like to 
control the inner split and we need to store the metrics for each fold

The way we obtained the indices for the optimization, train and test phase is 
the equivalent of something like that:

rs = ShuffleSplit(n_splits=9, test_size=.25,random_state=42)
indices_for_each_cv = list(rs.split(data[0:11]))

Maybe I can make myself clearer if I write what we would like to achieve for 
the first cross validation fold (I acknowledge that the previous email was 
quite a mess, sorry). Outer loop: 48 for training, 20 for testing. Of the 48 
training subjects, we would like to use 42 for optimization, 6 for testing the 
parameters. We got the indices so that we match the different scanners even in 
the optimization phase, but we are not able to pass them to GridSearch object.

The following did not work. This is what we get --> ValueError: too many values 
to unpack

ii = 0

while ii < len(cv_final):
# fit and predict

clf = GridSearchCV(
pipeline,
param_grid=param_grid,
verbose=1,
                cv = cv_final_nested[ii], # how to split the 48 train subjects 
for the optimization
scoring='roc_auc',
n_jobs= -1)

clf.fit(data[cv_final[ii][0]], y[cv_final[ii][0]]) # the train data of the 
outer loop for the first (i.e. the 48 subjects)
predictions.append(clf.predict(data[cv_final[ii][1]])) # Predict the 20 
subjects left out for test in the outer loop

ii = ii + 1

This however works and should be (more or less) what we would like to achieve 
with the above loop. However, extracting the best parameters for each fold in 
order to predict the left out data seems impossible or very laborious.

clf = GridSearchCV(
pipeline,
    param_grid=param_grid,
verbose=1,
              cv = cv_final_nested,
scoring='roc_auc',
n_jobs= -1)

clf.fit(data,y)

Any hint on how to solve this problem would be really appreciated.

Best
Ludovico

________________________________
Da: scikit-learn <[email protected]> per 
conto di [email protected] <[email protected]>
Inviato: lunedì 27 febbraio 2017 17.27
A: [email protected]
Oggetto: scikit-learn Digest, Vol 11, Issue 29

Send scikit-learn mailing list submissions to
        [email protected]

To subscribe or unsubscribe via the World Wide Web, visit
        https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn Info Page - 
Python<https://mail.python.org/mailman/listinfo/scikit-learn>
mail.python.org
To see the collection of prior postings to the list, visit the scikit-learn 
Archives. Using scikit-learn: To post a message to all the list members ...

or, via email, send a message with subject or body 'help' to
        [email protected]

You can reach the person managing the list at
        [email protected]

When replying, please edit your Subject line so it is more specific
than "Re: Contents of scikit-learn digest..."

Today's Topics:

   1. GSoC 2017 (Gael Varoquaux)
   2. Control over the inner loop in GridSearchCV (Ludovico Coletta)
   3. Re: Control over the inner loop in GridSearchCV
      (Sebastian Raschka)

----------------------------------------------------------------------

Message: 1
Date: Mon, 27 Feb 2017 11:58:35 +0100
From: Gael Varoquaux <[email protected]>
To: Scikit-learn user and developer mailing list
        <[email protected]>
Subject: [scikit-learn] GSoC 2017
Message-ID: <[email protected]>
Content-Type: text/plain; charset=iso-8859-1

Hi,

Students have been inquiring about the GSoC (Google Summer of Code) with
scikit-learn, and the core team has been quite silent about team.

I am happy to announce that we will be taking part in the scikit-learn
again. The reason that we decided to do this is to give a chance to the
young, talented, and motivated students.

Importantly, our most limiting resource is the time of our experienced
developers. This is clearly visible from the number of pending pull
requests. Hence, we need students to be very able and independent. This
of course means that they will be getting supervision from mentors. Such
supervision is crucial for moving forward with a good project, that
delivers mergeable code. However, we will need the students to be very
good at interacting efficiently with the mentors. Also, I should stress
that we will be able to take only a very few numbers of students.

With that said, let me introduce the 2017 GSoC for scikit-learn. We have
set up a wiki page which summarizes the experiences from last year and
the ideas for this year:
https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2017
Google summer of code (GSOC) 2017 · scikit-learn/scikit-learn Wiki · 
GitHub<https://github.com/scikit-learn/scikit-learn/wiki/Google-summer-of-code-(GSOC)-2017>
github.com
scikit-learn: machine learning in Python

Interested students should declare their interest on the mailing list,
and discuss with possible mentors here. Factors of success will be

* careful work on a good proposal, that takes on of the ideas on the wiki
  but breaks it down in a realistic plan with multiple steps and shows a
  good understanding of the problem.

* demonstration of the required skillset via successful pull requests in
  scikit-learn.

Cheers,

Ga?l

--
    Gael Varoquaux
    Researcher, INRIA Parietal
    NeuroSpin/CEA Saclay , Bat 145, 91191 Gif-sur-Yvette France
    Phone:  ++ 33-1-69-08-79-68
    http://gael-varoquaux.info            http://twitter.com/GaelVaroquaux
Gael Varoquaux (@GaelVaroquaux) | Twitter<http://twitter.com/GaelVaroquaux>
twitter.com
The latest Tweets from Gael Varoquaux (@GaelVaroquaux). Researcher and geek: 
►Brain, Data, & Computational science ►#python #pydata #sklearn ►Machine 
learning for fMRI ►Photography on @artgael. Paris, France

Gaël Varoquaux: computer / data / brain science<http://gael-varoquaux.info/>
gael-varoquaux.info
Gaël Varoquaux, computer / data / brain science ... Latest posts . misc 
personnal programming science Our research in 2016: personal scientific 
highlights

------------------------------

Message: 2
Date: Mon, 27 Feb 2017 14:27:59 +0000
From: Ludovico Coletta <[email protected]>
To: "[email protected]" <[email protected]>
Subject: [scikit-learn] Control over the inner loop in GridSearchCV
Message-ID:

<blupr0301mb2017606e3e103266bbab5e698c...@blupr0301mb2017.namprd03.prod.outlook.com>

Content-Type: text/plain; charset="iso-8859-1"

Dear Scikit experts,

we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we 
hope you will.

We are analysing neuroimaging data coming from 3 different MRI scanners, where 
for each scanner we have a healthy group and a disease group. We would like to 
merge the data from the 3 different scanners in order to classify the healthy 
subjects from the one who have the disease.

The problem is that we can almost perfectly classify the subjects according to 
the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We are 
using a custom cross validation schema to account for the different scanners: 
when no hyper-parameter (SVM) optimization is performed, everything is 
straightforward. Problems arise when we would like to perform hyperparameter 
optimization: in this case we need to balance for the different scanner in the 
optimization phase as well. We also found a custom cv schema for this, but we 
are not able to pass it to GridSearchCV object. We would like to get something 
like the following:

pipeline = Pipeline([('scl', StandardScaler()),
                    ('sel', RFE(estimator,step=0.2)),
                                    ('clf', SVC(probability=True, 
random_state=42))])

param_grid = [{'sel__n_features_to_select':[22,15,10,2],
                           'clf__C': np.logspace(-3, 5, 100),
                   'clf__kernel':['linear']}]

clf = GridSearchCV(pipeline,
                          param_grid=param_grid,
                  verbose=1,
                                  scoring='roc_auc',
                  n_jobs= -1)

# cv_final is the custom cv for the outer loop (9 folds)

ii = 0

while ii < len(cv_final):
# fit and predict

clf.fit(data[?]], y[[?]])
predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
ii = ii + 1

We tried almost everything. When we define clf in the loop, we pass the -ith 
cv_nested as cv argument, and we fit it on the training data of the -ith 
custom_cv fold, we get an "Too many values to unpack" error. On the other end, 
when we try to pass the nested -ith cv fold as cv argument for clf, and we call 
fit on the same cv_nested fold, we get an "Index out of bound" error.

Two questions:

1) Is there any workaround to avoid the split when clf is called without a cv 
argument?

2) We suppose that for hyperparameter optimization the test data is removed 
from the dataset and a  new dataset is created. Is this true? In this case we 
only have to adjust the indices accordingly

Thank your for your time and sorry for the long text

Ludovico
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://mail.python.org/pipermail/scikit-learn/attachments/20170227/e80777cb/attachment-0001.html>

------------------------------

Message: 3
Date: Mon, 27 Feb 2017 11:27:24 -0500
From: Sebastian Raschka <[email protected]>
To: Scikit-learn user and developer mailing list
        <[email protected]>
Subject: Re: [scikit-learn] Control over the inner loop in
        GridSearchCV
Message-ID: <[email protected]>
Content-Type: text/plain; charset=utf-8

Hi, Ludovico,
what format (shape) is data in? Are these the arrays from a Kfold iterator? In 
this case, the ?question marks? in your code snippet should simply be the train 
and validation subset indices generated by the KFold generator. E.g.,

skfold = StratifiedKFold(y=y_train, n_folds=5, shuffle=True, random_state=1)
for outer_train_idx, outer_valid_idx in skfold:
    ?
    gridsearch_object.fit(X_train[outer_train_idx], y_train[outer_train_idx])

>
> On the other end, when we try to pass the nested -ith cv fold as cv argument 
> for clf, and we call fit on the same cv_nested fold, we get an "Index out of 
> bound" error.
> Two questions:

Are you using an version older than scikit-learn 0.18? Techically, the 
GridSearchCV, RandomizedSearchCV, cross_val_score ? should all support 
iterables that of train_ and test_indices e.g.:

outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1)

for name, gs_est in sorted(gridcvs.items()):
    nested_score = cross_val_score(gs_est,
    X=X_train,
    y=y_train,
   cv=outer_cv,
   n_jobs=1)

Best,
Sebastian

> On Feb 27, 2017, at 9:27 AM, Ludovico Coletta <[email protected]> wrote:
>
> Dear Scikit experts,
>
> we am stucked with GridSearchCV. Nobody else was able/wanted to help us, we 
> hope you will.
>
> We are analysing neuroimaging data coming from 3 different MRI scanners, 
> where for each scanner we have a healthy group and a disease group. We would 
> like to merge the data from the 3 different scanners in order to classify the 
> healthy subjects from the one who have the disease.
>
> The problem is that we can almost perfectly classify the subjects according 
> to the scanner (e.g. the healthy subjects from scanner 1 and scanner 2). We 
> are using a custom cross validation schema to account for the different 
> scanners: when no hyper-parameter (SVM) optimization is performed, everything 
> is straightforward. Problems arise when we would like to perform 
> hyperparameter optimization: in this case we need to balance for the 
> different scanner in the optimization phase as well. We also found a custom 
> cv schema for this, but we are not able to pass it to GridSearchCV object. We 
> would like to get something like the following:
>
> pipeline = Pipeline([('scl', StandardScaler()),
>                     ('sel', RFE(estimator,step=0.2)),
>                                     ('clf', SVC(probability=True, 
> random_state=42))])
>
>
> param_grid = [{'sel__n_features_to_select':[22,15,10,2],
>                            'clf__C': np.logspace(-3, 5, 100),
>                    'clf__kernel':['linear']}]
>
> clf = GridSearchCV(pipeline,
>                           param_grid=param_grid,
>                   verbose=1,
>                                   scoring='roc_auc',
>                   n_jobs= -1)
>
> # cv_final is the custom cv for the outer loop (9 folds)
>
> ii = 0
>
> while ii < len(cv_final):
> # fit and predict
>
> clf.fit(data[?]], y[[?]])
> predictions.append(clf.predict(data[cv_final[ii][1]])) # outer test data
> ii = ii + 1
>
> We tried almost everything. When we define clf in the loop, we pass the -ith 
> cv_nested as cv argument, and we fit it on the training data of the -ith 
> custom_cv fold, we get an "Too many values to unpack" error. On the other 
> end, when we try to pass the nested -ith cv fold as cv argument for clf, and 
> we call fit on the same cv_nested fold, we get an "Index out of bound" error.
> Two questions:
> 1) Is there any workaround to avoid the split when clf is called without a cv 
> argument?
> 2) We suppose that for hyperparameter optimization the test data is removed 
> from the dataset and a  new dataset is created. Is this true? In this case we 
> only have to adjust the indices accordingly
>
> Thank your for your time and sorry for the long text
> Ludovico
> _______________________________________________
> scikit-learn mailing list
> [email protected]
> https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn Info Page - 
Python<https://mail.python.org/mailman/listinfo/scikit-learn>
mail.python.org
To see the collection of prior postings to the list, visit the scikit-learn 
Archives. Using scikit-learn: To post a message to all the list members ...

------------------------------

Subject: Digest Footer

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn
scikit-learn Info Page - 
Python<https://mail.python.org/mailman/listinfo/scikit-learn>
mail.python.org
To see the collection of prior postings to the list, visit the scikit-learn 
Archives. Using scikit-learn: To post a message to all the list members ...

------------------------------

End of scikit-learn Digest, Vol 11, Issue 29
********************************************

_______________________________________________
scikit-learn mailing list
[email protected]
https://mail.python.org/mailman/listinfo/scikit-learn

Re: [scikit-learn] scikit-learn Digest, Vol 11, Issue 29

Reply via email to