Hi Timothy.
Without seeing the actual code, it is hard to guess what is happening. Often there is some slight error in the processing that results in strange outcomes.

Firstly, you could perform the splitting and scoring using the cross_val_score function and using a RandomizedSplitCV with just a single seed.
That also allows for parallelization by just setting n_jobs.
Very high discrepancies between randomized folds probably mean that something funny is going on with your data, and that some parts are highly correlated, while others are not.

To explain the non-reproducibility, could it be that you are using "probability=True" in the SVC? If this is the case, you also need to set the random_state of SVC. Using probablity=True is usually not something you would want to do. The LinearSVC always needs a random_state to be reproducible.

You should solve the reproducibility issue first. If you still see this large discrepancy with reproducible results, you could try to look at the points that are correctly classified vs those that are not and see if you can find a pattern. cross_val_predict can be useful for that.

Hth,
Andy


On 01/06/2015 06:10 AM, Timothy Vivian-Griffiths wrote:
I am having some confusing results arising when I am carrying out a permutation exercise on my data using the SVC function from the sklearn.svm module. The data I am using is quite large and has very high dimensionality, but I will try to explain it briefly.

The dataset represent risk scores for genetic polymorphisms in a case/control schizophrenia study. For each sample, there are 31166 predictors and a binary outcome variable. All the predictors have a similar level of variance, so I did not deem it necessary to carry out any feature scaling.

There is a total of 7763 samples in the training data set which I am using to build the model. I was not expecting to get a very accurate model as there have been many attempts to build predictive models of shizophrenia status without much resounding success. This could be due to the high amount of heterogeneity in the disorder, meaning that it could actually be a collection of different disorders, each with different underlying genetic etiology.

For this reason, I decided to carry out a permutation procedure 500 times, by selecting 75% of the samples to train the model, with the remaining 25% kept as the cross validation set. I then scored this set with the roc_auc_score from sklearn.metrics, and store all of the results. I performed the split using the train_test_split function from sklearn.cross_validation; and in order to allow reproducability of the procedure, I set the seeds manually for the splits at each permutation. The 500 random seeds were created in advanced and stored as a .npy array file which could be loaded for every run of the procedure. These seeds were also used to set the 'random_state' for each model when it was built. The intention was to get a distribution of scores to see how the model performed in general, and the variance of the distribution that resulted from using different train/test splits.

I ran this procedure using the linear kernel (and later the RBF kernel with similar results), I tried varying values of the C parameter of 1 and 1000 in order to get two extremes (but in fact the results did not change much between them). However, the results were quite surprising. What I found was that the majority of the splits were performing relatively poorly (as I expected) at around 0.59 - 0.6 accuracy; but about 15% of the splits performed very highly (~0.9 accuracy). I saved those seeds that performed highly in order to try and reproduce the results.

It is my understanding that given the same training and test data, with the same random_state, the SVC algorithm should reproduce the same model, but this is not happening. When I ran the permutation procedure on the previous high performing seeds, most of these reverted to the similar poor performance as the others, with again 15% of this smaller set resulting in high performance. I then decided to isolate out some of the seeds that had performed well both times. I ran the procedure multiple times, using one of these seeds repeatedly, and I got what I expected, the exact same low score (~0.6) for every iteration. I did not succeed in finding a seed that reproduced a previous high performance.

So what is confusing me is: why did these high performing scores show up at all in the first place? When isolating out individual train/test splits, I am getting what I expected - a repetition of the exact same performance score. However, this is not happening when running the full permutation procedure. If anyone can give any insights then I would be very grateful.

I have tried to provide as much detail on the data as possible without writing too much, but I can provide additional information if it would be helpful. I should also add that I was running these programs on a cluster in parallel over 16 threads using the pool and map functions from the multiprocessing module in core Python.

Many thanks,

Tim
-------------------------
Tim Vivian-Griffiths
Wellcome Trust PhD Student
Biostatistics and Bioinformatics Unit
Institute of Psychological Medicine and Clinical Neurosciences
Cardiff University
Hadyn Ellis Building
Maindy Road
Cardiff CF24 4HQ



------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to