Just a quick follow up to some of the previous problems that I have had: after
getting some kind assistance at the PyData London meetup last week, I found out
why I was getting different results using an SVC in R, and it was happening
because R scales the inputs automatically whereas sklearn does not, so that was
an easy solution to that problem.
The other problem I was having was getting the different, non-reproducible
results using an SVC algorithm on a large data set. I think this may have been
due to how I was implementing the parallelisation of the code. I have therefore
decided to stick with the inbuilt parallelisation methods of sklearn instead.
However, because of this, I am having some difficulties putting together a
pipeline with cross validation and feature extraction.
I have a dataset with a very large number of features (> 22,000), and it taking
very long to build and cross validate any models. What I would like to do is to
carry out the CV process, but adding in feature extraction at every fold,
before a model is built. I have read that it is important not to carry out a
feature extraction on the whole dataset first, as this is using information
from any samples that would go into the test set when deciding which are the
important predictors. I would also like to include a parameter search using the
RandomizedSearchCV function into this process as well.
The end result that I would like is an example of the best model that was
selected from the parameter search, together with a list of the predictors that
were used to build it. An example of some code that I have attempted is as
follows:
from __future__ import print_function
import numpy as np
from sklearn.cross_validation import train_test_split, StratifiedShuffleSplit
from sklearn.svm import SVC
from sklearn.grid_search import RandomizedSearchCV
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
from scipy.stats import expon, poisson
inputs =
np.load('combo_pgc2_parameters_p_05_allele_counts_phenotype_array.npy’) # a
7731 by 22567 matrix - too many dimensions
targets = np.load('COMBO_phenotypes.npy')
X_train, X_test, y_train, y_test = train_test_split(inputs, targets)
cv_model = StratifiedShuffleSplit(y_train, n_iter=4)
svm_poly = SVC(kernel='poly')
selector = SelectKBest(chi2, k=5000)
clf = Pipeline([('feature_selection', selector), ('classification',
svm_poly)])
param_dists = {'classification__C':expon(scale=100),
'classification__gamma':expon(scale=0.1),
'classification__degree':poisson(mu=1, loc=1)}
if __name__=='__main__':
grid = RandomizedSearchCV(clf, param_distributions=param_dists,
cv=cv_model, n_iter=4, n_jobs=-1)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
score = roc_auc_score(y_test, y_pred)
print(score)
While this is the best that I can do at the moment, I’m not sure that it doing
exactly what I want it to. Is it actually breaking down the data into the
folds, then carrying out feature selection, or is it carrying out the feature
selection first, then proceeding with the cross-validation? And finally, if
this is indeed doing what I want it to, how do I find out which of the
predictors it has chosen to be included in the final model with the best
parameters?
Many thanks,
Tim V-G
------------------------------------------------------------------------------
Dive into the World of Parallel Programming. The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net/
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general