Re: [Scikit-learn-general] Different results when repeating SVM algorithm

Andy Tue, 06 Jan 2015 07:09:24 -0800

Hi Timothy.

Without seeing the actual code, it is hard to guess what is happening.Often there is some slight error in the processing that results instrange outcomes.

Firstly, you could perform the splitting and scoring using thecross_val_score function and using a RandomizedSplitCV with just asingle seed.

That also allows for parallelization by just setting n_jobs.

Very high discrepancies between randomized folds probably mean thatsomething funny is going on with your data, and that some parts arehighly correlated, while others are not.

To explain the non-reproducibility, could it be that you are using"probability=True" in the SVC? If this is the case, you also need to setthe random_state of SVC.Using probablity=True is usually not something you would want to do. TheLinearSVC always needs a random_state to be reproducible.

You should solve the reproducibility issue first. If you still see thislarge discrepancy with reproducible results, you could try to look atthe points that are correctly classifiedvs those that are not and see if you can find a pattern.cross_val_predict can be useful for that.


Hth,
Andy


On 01/06/2015 06:10 AM, Timothy Vivian-Griffiths wrote:

I am having some confusing results arising when I am carrying out apermutation exercise on my data using the SVC function from thesklearn.svm module. The data I am using is quite large and has veryhigh dimensionality, but I will try to explain it briefly.
The dataset represent risk scores for genetic polymorphisms in acase/control schizophrenia study. For each sample, there are 31166predictors and a binary outcome variable. All the predictors have asimilar level of variance, so I did not deem it necessary to carry outany feature scaling.
There is a total of 7763 samples in the training data set which I amusing to build the model. I was not expecting to get a very accuratemodel as there have been many attempts to build predictive models ofshizophrenia status without much resounding success. This could be dueto the high amount of heterogeneity in the disorder, meaning that itcould actually be a collection of different disorders, each withdifferent underlying genetic etiology.
For this reason, I decided to carry out a permutation procedure 500times, by selecting 75% of the samples to train the model, with theremaining 25% kept as the cross validation set. I then scored this setwith the roc_auc_score from sklearn.metrics, and store all of theresults. I performed the split using the train_test_split functionfrom sklearn.cross_validation; and in order to allow reproducabilityof the procedure, I set the seeds manually for the splits at eachpermutation. The 500 random seeds were created in advanced and storedas a .npy array file which could be loaded for every run of theprocedure. These seeds were also used to set the 'random_state' foreach model when it was built. The intention was to get a distributionof scores to see how the model performed in general, and the varianceof the distribution that resulted from using different train/test splits.
I ran this procedure using the linear kernel (and later the RBF kernelwith similar results), I tried varying values of the C parameter of 1and 1000 in order to get two extremes (but in fact the results did notchange much between them). However, the results were quite surprising.What I found was that the majority of the splits were performingrelatively poorly (as I expected) at around 0.59 - 0.6 accuracy; butabout 15% of the splits performed very highly (~0.9 accuracy). I savedthose seeds that performed highly in order to try and reproduce theresults.
It is my understanding that given the same training and test data,with the same random_state, the SVC algorithm should reproduce thesame model, but this is not happening. When I ran the permutationprocedure on the previous high performing seeds, most of thesereverted to the similar poor performance as the others, with again 15%of this smaller set resulting in high performance. I then decided toisolate out some of the seeds that had performed well both times. Iran the procedure multiple times, using one of these seeds repeatedly,and I got what I expected, the exact same low score (~0.6) for everyiteration. I did not succeed in finding a seed that reproduced aprevious high performance.
So what is confusing me is: why did these high performing scores showup at all in the first place? When isolating out individual train/testsplits, I am getting what I expected - a repetition of the exact sameperformance score. However, this is not happening when running thefull permutation procedure. If anyone can give any insights then Iwould be very grateful.
I have tried to provide as much detail on the data as possible withoutwriting too much, but I can provide additional information if it wouldbe helpful. I should also add that I was running these programs on acluster in parallel over 16 threads using the pool and map functionsfrom the multiprocessing module in core Python.
Many thanks,

Tim
-------------------------
Tim Vivian-Griffiths
Wellcome Trust PhD Student
Biostatistics and Bioinformatics Unit
Institute of Psychological Medicine and Clinical Neurosciences
Cardiff University
Hadyn Ellis Building
Maindy Road
Cardiff CF24 4HQ



------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net


_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Different results when repeating SVM algorithm

Reply via email to