Just a quick follow up to some of the previous problems that I have had: after 
getting some kind assistance at the PyData London meetup last week, I found out 
why I was getting different results using an SVC in R, and it was happening 
because R scales the inputs automatically whereas sklearn does not, so that was 
an easy solution to that problem.

The other problem I was having was getting the different, non-reproducible 
results using an SVC algorithm on a large data set. I think this may have been 
due to how I was implementing the parallelisation of the code. I have therefore 
decided to stick with the inbuilt parallelisation methods of sklearn instead. 
However, because of this, I am having some difficulties putting together a 
pipeline with cross validation and feature extraction.

I have a dataset with a very large number of features (> 22,000), and it taking 
very long to build and cross validate any models. What I would like to do is to 
carry out the CV process, but adding in feature extraction at every fold, 
before a model is built. I have read that it is important not to carry out a 
feature extraction on the whole dataset first, as this is using information 
from any samples that would go into the test set when deciding which are the 
important predictors. I would also like to include a parameter search using the 
RandomizedSearchCV function into this process as well.

The end result that I would like is an example of the best model that was 
selected from the parameter search, together with a list of the predictors that 
were used to build it. An example of some code that I have attempted is as 
follows:

from __future__ import print_function
import numpy as np
from sklearn.cross_validation import train_test_split, StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.grid_search import RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from scipy.stats import expon, poisson

inputs      = 
np.load('combo_pgc2_parameters_p_05_allele_counts_phenotype_array.npy’) # a 
7731 by 22567 matrix - too many dimensions
targets     = np.load('COMBO_phenotypes.npy')

X_train, X_test, y_train, y_test = train_test_split(inputs, targets)

cv_model    = StratifiedShuffleSplit(y_train, n_iter=4)
svm_poly    = SVC(kernel='poly')
selector    = SelectKBest(chi2, k=5000)
clf         = Pipeline([('feature_selection', selector), ('classification', 
svm_poly)])
param_dists = {'classification__C':expon(scale=100), 
'classification__gamma':expon(scale=0.1), 
'classification__degree':poisson(mu=1, loc=1)}

if __name__=='__main__':
    grid        = RandomizedSearchCV(clf, param_distributions=param_dists, 
cv=cv_model, n_iter=4, n_jobs=-1)
    grid.fit(X_train, y_train)
    y_pred = grid.predict(X_test)
    score = roc_auc_score(y_test, y_pred)
    print(score)


While this is the best that I can do at the moment, I’m not sure that it doing 
exactly what I want it to. Is it actually breaking down the data into the 
folds, then carrying out feature selection, or is it carrying out the feature 
selection first, then proceeding with the cross-validation? And finally, if 
this is indeed doing what I want it to, how do I find out which of the 
predictors it has chosen to be included in the final model with the best 
parameters?

Many thanks,

Tim V-G
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to