Hi Timothy.
Without seeing the actual code, it is hard to guess what is happening.
Often there is some slight error in the processing that results in
strange outcomes.
Firstly, you could perform the splitting and scoring using the
cross_val_score function and using a RandomizedSplitCV with just a
single seed.
That also allows for parallelization by just setting n_jobs.
Very high discrepancies between randomized folds probably mean that
something funny is going on with your data, and that some parts are
highly correlated, while others are not.
To explain the non-reproducibility, could it be that you are using
"probability=True" in the SVC? If this is the case, you also need to set
the random_state of SVC.
Using probablity=True is usually not something you would want to do. The
LinearSVC always needs a random_state to be reproducible.
You should solve the reproducibility issue first. If you still see this
large discrepancy with reproducible results, you could try to look at
the points that are correctly classified
vs those that are not and see if you can find a pattern.
cross_val_predict can be useful for that.
Hth,
Andy
On 01/06/2015 06:10 AM, Timothy Vivian-Griffiths wrote:
I am having some confusing results arising when I am carrying out a
permutation exercise on my data using the SVC function from the
sklearn.svm module. The data I am using is quite large and has very
high dimensionality, but I will try to explain it briefly.
The dataset represent risk scores for genetic polymorphisms in a
case/control schizophrenia study. For each sample, there are 31166
predictors and a binary outcome variable. All the predictors have a
similar level of variance, so I did not deem it necessary to carry out
any feature scaling.
There is a total of 7763 samples in the training data set which I am
using to build the model. I was not expecting to get a very accurate
model as there have been many attempts to build predictive models of
shizophrenia status without much resounding success. This could be due
to the high amount of heterogeneity in the disorder, meaning that it
could actually be a collection of different disorders, each with
different underlying genetic etiology.
For this reason, I decided to carry out a permutation procedure 500
times, by selecting 75% of the samples to train the model, with the
remaining 25% kept as the cross validation set. I then scored this set
with the roc_auc_score from sklearn.metrics, and store all of the
results. I performed the split using the train_test_split function
from sklearn.cross_validation; and in order to allow reproducability
of the procedure, I set the seeds manually for the splits at each
permutation. The 500 random seeds were created in advanced and stored
as a .npy array file which could be loaded for every run of the
procedure. These seeds were also used to set the 'random_state' for
each model when it was built. The intention was to get a distribution
of scores to see how the model performed in general, and the variance
of the distribution that resulted from using different train/test splits.
I ran this procedure using the linear kernel (and later the RBF kernel
with similar results), I tried varying values of the C parameter of 1
and 1000 in order to get two extremes (but in fact the results did not
change much between them). However, the results were quite surprising.
What I found was that the majority of the splits were performing
relatively poorly (as I expected) at around 0.59 - 0.6 accuracy; but
about 15% of the splits performed very highly (~0.9 accuracy). I saved
those seeds that performed highly in order to try and reproduce the
results.
It is my understanding that given the same training and test data,
with the same random_state, the SVC algorithm should reproduce the
same model, but this is not happening. When I ran the permutation
procedure on the previous high performing seeds, most of these
reverted to the similar poor performance as the others, with again 15%
of this smaller set resulting in high performance. I then decided to
isolate out some of the seeds that had performed well both times. I
ran the procedure multiple times, using one of these seeds repeatedly,
and I got what I expected, the exact same low score (~0.6) for every
iteration. I did not succeed in finding a seed that reproduced a
previous high performance.
So what is confusing me is: why did these high performing scores show
up at all in the first place? When isolating out individual train/test
splits, I am getting what I expected - a repetition of the exact same
performance score. However, this is not happening when running the
full permutation procedure. If anyone can give any insights then I
would be very grateful.
I have tried to provide as much detail on the data as possible without
writing too much, but I can provide additional information if it would
be helpful. I should also add that I was running these programs on a
cluster in parallel over 16 threads using the pool and map functions
from the multiprocessing module in core Python.
Many thanks,
Tim
-------------------------
Tim Vivian-Griffiths
Wellcome Trust PhD Student
Biostatistics and Bioinformatics Unit
Institute of Psychological Medicine and Clinical Neurosciences
Cardiff University
Hadyn Ellis Building
Maindy Road
Cardiff CF24 4HQ
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general