I am having some confusing results arising when I am carrying out a permutation 
exercise on my data using the SVC function from the sklearn.svm module. The 
data I am using is quite large and has very high dimensionality, but I will try 
to explain it briefly.

The dataset represent risk scores for genetic polymorphisms in a case/control 
schizophrenia study. For each sample, there are 31166 predictors and a binary 
outcome variable. All the predictors have a similar level of variance, so I did 
not deem it necessary to carry out any feature scaling.

There is a total of 7763 samples in the training data set which I am using to 
build the model. I was not expecting to get a very accurate model as there have 
been many attempts to build predictive models of shizophrenia status without 
much resounding success. This could be due to the high amount of heterogeneity 
in the disorder, meaning that it could actually be a collection of different 
disorders, each with different underlying genetic etiology.

For this reason, I decided to carry out a permutation procedure 500 times, by 
selecting 75% of the samples to train the model, with the remaining 25% kept as 
the cross validation set. I then scored this set with the roc_auc_score from 
sklearn.metrics, and store all of the results. I performed the split using the 
train_test_split function from sklearn.cross_validation; and in order to allow 
reproducability of the procedure, I set the seeds manually for the splits at 
each permutation. The 500 random seeds were created in advanced and stored as a 
.npy array file which could be loaded for every run of the procedure. These 
seeds were also used to set the 'random_state' for each model when it was 
built. The intention was to get a distribution of scores to see how the model 
performed in general, and the variance of the distribution that resulted from 
using different train/test splits.

I ran this procedure using the linear kernel (and later the RBF kernel with 
similar results), I tried varying values of the C parameter of 1 and 1000 in 
order to get two extremes (but in fact the results did not change much between 
them). However, the results were quite surprising. What I found was that the 
majority of the splits were performing relatively poorly (as I expected) at 
around 0.59 - 0.6 accuracy; but about 15% of the splits performed very highly 
(~0.9 accuracy). I saved those seeds that performed highly in order to try and 
reproduce the results.

It is my understanding that given the same training and test data, with the 
same random_state, the SVC algorithm should reproduce the same model, but this 
is not happening. When I ran the permutation procedure on the previous high 
performing seeds, most of these reverted to the similar poor performance as the 
others, with again 15% of this smaller set resulting in high performance. I 
then decided to isolate out some of the seeds that had performed well both 
times. I ran the procedure multiple times, using one of these seeds repeatedly, 
and I got what I expected, the exact same low score (~0.6) for every iteration. 
I did not succeed in finding a seed that reproduced a previous high performance.

So what is confusing me is: why did these high performing scores show up at all 
in the first place? When isolating out individual train/test splits, I am 
getting what I expected - a repetition of the exact same performance score. 
However, this is not happening when running the full permutation procedure. If 
anyone can give any insights then I would be very grateful.

I have tried to provide as much detail on the data as possible without writing 
too much, but I can provide additional information if it would be helpful. I 
should also add that I was running these programs on a cluster in parallel over 
16 threads using the pool and map functions from the multiprocessing module in 
core Python.

Many thanks,

Tim
-------------------------
Tim Vivian-Griffiths
Wellcome Trust PhD Student
Biostatistics and Bioinformatics Unit
Institute of Psychological Medicine and Clinical Neurosciences
Cardiff University
Hadyn Ellis Building
Maindy Road
Cardiff CF24 4HQ

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to