I am having some confusing results arising when I am carrying out a permutation
exercise on my data using the SVC function from the sklearn.svm module. The
data I am using is quite large and has very high dimensionality, but I will try
to explain it briefly.
The dataset represent risk scores for genetic polymorphisms in a case/control
schizophrenia study. For each sample, there are 31166 predictors and a binary
outcome variable. All the predictors have a similar level of variance, so I did
not deem it necessary to carry out any feature scaling.
There is a total of 7763 samples in the training data set which I am using to
build the model. I was not expecting to get a very accurate model as there have
been many attempts to build predictive models of shizophrenia status without
much resounding success. This could be due to the high amount of heterogeneity
in the disorder, meaning that it could actually be a collection of different
disorders, each with different underlying genetic etiology.
For this reason, I decided to carry out a permutation procedure 500 times, by
selecting 75% of the samples to train the model, with the remaining 25% kept as
the cross validation set. I then scored this set with the roc_auc_score from
sklearn.metrics, and store all of the results. I performed the split using the
train_test_split function from sklearn.cross_validation; and in order to allow
reproducability of the procedure, I set the seeds manually for the splits at
each permutation. The 500 random seeds were created in advanced and stored as a
.npy array file which could be loaded for every run of the procedure. These
seeds were also used to set the 'random_state' for each model when it was
built. The intention was to get a distribution of scores to see how the model
performed in general, and the variance of the distribution that resulted from
using different train/test splits.
I ran this procedure using the linear kernel (and later the RBF kernel with
similar results), I tried varying values of the C parameter of 1 and 1000 in
order to get two extremes (but in fact the results did not change much between
them). However, the results were quite surprising. What I found was that the
majority of the splits were performing relatively poorly (as I expected) at
around 0.59 - 0.6 accuracy; but about 15% of the splits performed very highly
(~0.9 accuracy). I saved those seeds that performed highly in order to try and
reproduce the results.
It is my understanding that given the same training and test data, with the
same random_state, the SVC algorithm should reproduce the same model, but this
is not happening. When I ran the permutation procedure on the previous high
performing seeds, most of these reverted to the similar poor performance as the
others, with again 15% of this smaller set resulting in high performance. I
then decided to isolate out some of the seeds that had performed well both
times. I ran the procedure multiple times, using one of these seeds repeatedly,
and I got what I expected, the exact same low score (~0.6) for every iteration.
I did not succeed in finding a seed that reproduced a previous high performance.
So what is confusing me is: why did these high performing scores show up at all
in the first place? When isolating out individual train/test splits, I am
getting what I expected - a repetition of the exact same performance score.
However, this is not happening when running the full permutation procedure. If
anyone can give any insights then I would be very grateful.
I have tried to provide as much detail on the data as possible without writing
too much, but I can provide additional information if it would be helpful. I
should also add that I was running these programs on a cluster in parallel over
16 threads using the pool and map functions from the multiprocessing module in
core Python.
Many thanks,
Tim
-------------------------
Tim Vivian-Griffiths
Wellcome Trust PhD Student
Biostatistics and Bioinformatics Unit
Institute of Psychological Medicine and Clinical Neurosciences
Cardiff University
Hadyn Ellis Building
Maindy Road
Cardiff CF24 4HQ
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general