Hi everybody, 
at the end I think I (partially) solved my own problem by doing the following 
(maybe it can help somebody else):

# import the classifier
from sklearn.svm import SVC
# Pipeline
from sklearn.pipeline import Pipeline
# import and define a feature reduction technique
from sklearn.feature_selection import SelectKBest, f_classif
# Define the pipeline
pipeline = Pipeline([('sel', SelectKBest()),                      ('clf', 
SVC(kernel='linear'))])
param_grid = [{'sel__k': [410628, 200000, 100000, 50000, 20000, 10000, 5000, 
2500],                'clf__C': [0.001,0.01,0.1, 1, 10, 100,1000,10000,100000], 
               'clf__kernel': ['linear']}]               
# This works. However, it would be better to have your own cv in the pipeline
clf = GridSearchCV(pipeline,                    param_grid=param_grid,          
          verbose=1,                    # cv = ...                   
scoring='accuracy',                    n_jobs=1)

cv = cross_validation.ShuffleSplit(len(X), n_iter=10, test_size=0.2, 
random_state=0)
clf.fit(X,y) # I need to call fit because then I have to display the weights on 
a brain image
scores = cross_validation.cross_val_score(clf, X, y, cv = cv)

However, I would like to ask you a couple of things:
1) My biggest concern is to avoid double dipping. Do you think that what I did 
above is right? It is possible to somehow retrieve the indices of the nested 
samples for each of the 10 outer  folds? 
2) How big should be the biggest C in the grid search?

3) Is there a way to retrieve the weights of the "outer" fold(s) ? I tried with 
documentation, but I was unsuccessful 
BestLudovico

From: ludo25...@hotmail.com
To: scikit-learn-general@lists.sourceforge.net
Date: Wed, 20 Apr 2016 14:46:47 +0200
Subject: [Scikit-learn-general] Univariate feature selection with 
hyperparameter estimation on a neuroimaging dataset with scikit





        
        
        


Hi
guys,

I
am new to Python and scikit learn package so I hope someone can help
me.For
my master thesis I am analyzing a neuroimaging dataset. I have 24
subjects divided in two classes (12 subjects each) that I would like
to classify.

My
idea is to use "SelectKBest" for selecting the best
features, run a GridSearch for the C parameter, filter the held-out
test data with the results of SelectKBest, select the best C from the
GridSearch and use it to classify the the held out samples. To do
this I have to implement two cross-validations on the same dataset:
one "outer" cv for defining the test sample, and a nested
cv for finding the best features and the best C.

As
cross-validation I would like to use the stratified one. Therefore,
if I got things right, I have to do the following (example of the
first fold of the two cross-validation): 
        
        
        










subject
1 (subject 1 of class 1) and 13 (subject 1 of class 2) as test of the
outer cross-validation, subjects 2 and 14 as test sample of the
nested cross-validation (for testing the best C) and subjects 3:12
and 15:24 for selecting the best 20000 features and the best C. I
think that I have done everything right until the point in which I
have to filter the held out data with the selected features. Here I'm
doing a mistake and I reached 100% accuracy. I also tried to change
modality (other features), but I keep getting 100% accuracy. 
Here the code for the first fold. Any help would be greatly appreciated.
Thank youLudo
The
numbers refer to index of the array in which I stored the data, y are the 
labels.0:11 --> subjects of class 112:23 --> subjects of class 2I did the same 
for every fold`# outer cvcv_outer = StratifiedKFold(y, 12)train_nested1 = 
[[2,3,4,5,6,7,8,9,10,11,14,15,16,17,18,19,20,21,22,23]]test_nested1
= [[1,13]]cv_nested1
= zip(train_nested1,test_nested1);# Classifiers, feature selection, 
hyperparameter optimization and pipeline#import the classifierfrom sklearn.svm 
import SVC# Pipelinefrom sklearn.pipeline import Pipeline# import and define a 
feature reduction techniquefrom sklearn.feature_selection import SelectKBest, 
f_classifpipeline = Pipeline([('sel', SelectKBest()), 
('clf',SVC(kernel='linear'))])param_grid = [{'sel__k': [80000, 40000, 20000, 
10000, 5000, 2500], 'clf__C':[0.001,0.01,0.1, 1, 10, 100,1000,10000], 
'clf__kernel':['linear']}]
#FOLD 1grid_search1 = GridSearchCV(pipeline, param_grid=param_grid, verbose=1, 
cv=cv_nested1, 
scoring='accuracy',n_jobs=1)grid_search1.fit(X,y)print(grid_search1.best_estimator_)print(grid_search1.best_score_)
# Now we test the held out data. Example of the first fold# FOLD 1cv_scores1 = 
[] a_1 = clf_final1.named_steps['sel'] # Extract the selector objectb_1 = 
a_1.transform(X[list(cv_outer)[0][1]]) # transform the corresponding held out 
datac_1 = clf_final1.named_steps['clf'] # Extract the classifier object(best C 
parameter)labels_pred1 = c_1.predict(b_1) # 
predictcv_scores1.append(np.sum(labels_pred1 == y[list(cv_outer)[0][1]]))`


                                          

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
                                          
------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to