Ok, well once again, thank you for your reply. I will provide some of my code
here and I hope that it helps. Just bear in mind that I have not had any formal
training in programming, so my code will very much come under the category of
'PhDware'.
So, this was the class that I wrote that gets imported into my different
analysis scripts:
# Filename: Clozuk_Machine_Learning_saving_random_state.py
"""
A module that is to be imported into other Python scripts in order to run
the different algorithms on the data.
"""
import numpy as np
import numpy.random as npr
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score
class DimensionError(Exception):
pass
class clSVC(SVC):
"""
ccSVC(inputs, results, save_path, C=1, kernel='rbf', tp=0.5)
Sets up a Support Vector Classifier using the SVC object from the sklearn
package.
For detailed list of parameters and attributes - see documentation on SVC
from scikit-learn
Additional Parameters
---------------------
inputs : numpy array
A numpy array of all the input data to the model. Features represented
in columns and samples represented in rows. Must be a 2D array
results : numpy array
A one dimensional numpy array of the results of the data
save_path : string
The full path to where the answers for each permutation will be saved
C : float
A value representing how much importance is given to fitting the
training data versus regularisation. Higher C values mean that the model
is fitted to the data given, but could result in over-fitting
tp : float > 0 and <1
A value representing the proportion of the data used in the training
set with the remainder going to the test set
"""
def __init__(self, inputs, results, save_path, C=1, kernel='rbf', tp=0.5):
super(clSVC, self).__init__(C=C, kernel=kernel)
if len(inputs.shape)==2:
self.inputs = inputs
else:
raise DimensionError('Inputs array must be 2D')
if len(results.shape)==1:
self.results = results
else:
raise DimensionError('Results array must be 1D')
self.save_path = save_path
if tp >0 or tp < 1:
self.tp = tp
else:
raise ValueError('tp must be between 0 and 1. %0.2f entered' % tp)
def run_model(self, seed=None):
if seed==None:
seed = npr.randint(10000)
npr.seed(seed)
# Using this seed to set the random state as well
self.random_state = seed
X_train, X_test, y_train, y_test = train_test_split(self.inputs, \
self.results, train_size=self.tp)
self.fit(X_train, y_train)
predictions = self.predict(X_test)
percentage_correct = (np.sum(predictions == y_test) /
float(len(predictions))) * 100
score = roc_auc_score(y_test, predictions)
# Saving every iteration just in case the walltime runs out
fi = open('%s.txt' % self.save_path, 'a')
fi.write('%d %f %f\n' % (seed, percentage_correct, score))
fi.close()
return (seed, percentage_correct, score)
And here is an example of a script that imports and runs this:
# Filename: svm_125_GWAS_mp.py
"""
Reads in the genotypes and the phenotypes
and then carries out the whole SVC on them
"""
import numpy as np
from ClozukMachineLearning_saving_random_state import clSVC
from multiprocessing.dummy import Pool as ThreadPool
from sys import argv
# Getting the parameter of C from the input and converting it to an integer
# Also getting kernel
script, C, kernel = argv
C = int(C)
# Reading in the inputs and the targets
inputs =
np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
# Now building the actual model
svc = clSVC(inputs, targets, C=C,
save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d'
% (kernel, C), kernel=kernel, tp=0.75)
# Now reading in the random seeds
seeds = np.load('stored_data_2015/random_seeds.npy')
# Now setting up the multiprocessing pools
pool = ThreadPool(16)
# And mapping the learning function across these pools
answers = pool.map(svc.run_model, seeds)
# Closing and joining the results
pool.close()
pool.join()
Just to recap the problem that I am having for anyone new who might be reading
this. The aim of this is to try and predict case/control status of a
schizophrenia dataset given information on genotypes for the different
subjects. I am running this procedure 500 times for each dataset. Each time,
the data is split into training and test (75%, 25% respectively) in different
ways using the train_test_split function. Each split is determined by random
seed integers that are read in from the file 'random_seeds.npy'. The intention
was to train and test an SVC model each time, and then get a distribution of
the scores. I wanted to see what the general performance was, as well as the
effect of splitting the data into different training/test sets. For each
permutation, the seed is used to determine the train_test_split and the
random_state.
I thought that by using the same seeds this would result in the same
performance for the individual splits when carried out multiple times. However,
while this is happening for a smaller dataset with only 125 features, on a
larger one with over 31,000 features, it is not. For the majority of the
splits, I am getting ROC scores of around 0.6, which is what I would expect;
but in about 15% of the splits, the scores are grouped very highly, around 0.9.
When I first saw this, I wanted to find out what was different about those
particular train/test splits was causing the high scores, so I isolated the
seeds that resulted in good performance, and ran the procedure on those alone.
Instead of seeing a repetition of the high performance, I got a similar pattern
that was happening before: most scores were around 0.6, and about 15% were
around 0.9. So the performance for the same splits of the data, with the
random_state being set the same for both runs, are giving different results.
I decided to look at some of the seeds which resulted in high scores both
times, by running this procedure on them multiple times. Then I am getting what
I was expecting: the same answer (around 0.6) repeated for each run of each
seed.
So what is puzzling me is why any of these high scores appeared at all in the
first place?
Tim
On 7 Jan 2015, at 16:18, Andy <t3k...@gmail.com<mailto:t3k...@gmail.com>> wrote:
Hi Tim.
Please keep all discussions on the mailing list, as individual contributors
might not find the time to respond.
With fixed random seeds, results should be reproducible. If you provide the
full code, it might be possible to say where the problem lies.
Sorry, I meant ShuffleSplitCV, but you can also use any other CV object.
The cross_val_predict function is not super essential, just a convenience. You
should start with computing the scores using cross_val_score.
For otherwise maintained clusters: installing a version locally is super easy.
Just check out the dev version and set the pythonpath appropriately,
or install into a virtual environment. Many people have custom python libraries
(or whole environments) installed.
Cheers,
Andy
On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote:
Dear Andy,
Thank you for your reply on the scikit-learn problem I was having. Seeing as I
am new to this, I am writing to you directly; is this what I should do, or
should I reply to the response email that you gave me?
As for the reproducability, I have not set the probability to be True, so it
should be running on the default. I am also setting the random state parameter,
so I am puzzled as to what is happening. I haven't found a single split that is
reproducing high performing results. I understand that there will be
discrepancies in the data, but I don't understand why splits should perform
differently on different occasions.
Just another thing, I have noticed that the cross_val_predict function that you
mention is in the latest version of sklearn, but I cannot find the
RandomizedSplitCV one. Also, seeing as I am running my code on a cluster which
I do not maintain, I think it's probably best if I wait until 0.16 becomes the
stable version before I ask the admins to update. Do you have any idea of when
this might be?
Thanks for your help and apologies if I am not supposed to contact your email
address directly,
Tim
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general