Hi Tim. I think it is highly unlikely that splits are not reproducible, and that this is caused by parallelization. If you run ``cross_val_score`` twice with the same seed, the outcomes are exactly the same, right? How are you trying and failing to reproduce the scores?
Best, Andy On 01/19/2015 11:21 AM, Timothy Vivian-Griffiths wrote: > Hi Andy, > > *This is actually the second time that I sending this reply as I don't think > that it got to the mailing list, and I cannot see it in the archives.* > > Yes, I’m back to receiving the emails now. I noticed that in the previous > thread, there was some message about the website being down? But anyway, it’s > working now. Also, I noticed that my message with my code was repeated there, > and in the top case it looked awful, but lower down it was fine… so I don’t > know what was going on there. > > But back to the problem at hand… yes I’m still getting the strange cases of > ‘rogue’ high scores appearing when doing the 500 permutations, but none of > these is being repeated when these splits are scrutinised. What I have done > in the meantime is take your advice and use the cross_val_score function, > with 50 fold CV (leaving 155 or 156 in each test set) and this is seeming to > work really well in the smaller subset that I have. But from another thread, > I have found out that this is using the svm.decision_function instead of > svm.predict to get the scores, and I'm not familiar enough with this to know > if that would make a difference. > > In fact, the only reason that I was using my method was because I really > wanted to examine if the results were very susceptible to the splits of the > data into training and test. As k-fold only allows each sample to be in the > test set once, I thought this would allow for more flexibility. But I really > like the sound of the cross_val_predict function that you mentioned, so I > will install that version and give it a try. > > As for the problem… I still don’t know why that is occurring. My only guess > is that something is happening with the pool.map over the different CPUs on > the cluster. I don’t do any parallel programming myself though so I don’t > know how to delve into it. Could this be a reason for this to happen? > > Tim > > On 14 Jan 2015, at 02:28, Andy <t3k...@gmail.com> wrote: > >> Hi Tim. >> So are you saying that even when fixing the random_state in >> train_test_split, it is not reproducible? >> The archive code looks fine to me: >> http://sourceforge.net/p/scikit-learn/mailman/message/33220124/ >> Did you subscribe to the mailing list? Otherwise you will not get replies >> (except this one where I explicitly copied you into the "to"). >> >> Andy >> >> On 01/13/2015 11:08 AM, Timothy Vivian-Griffiths wrote: >>> Dear Joel, >>> >>> Thanks for the reply here. Apologies for the later delay, but I have not >>> been receiving the email updates, and I only noticed your reply when I >>> looked on the archive. >>> >>> My problem has now arisen in a smaller dataset with only 125 features >>> (small in genetics) and 7633 samples. Again I isolated out the seeds which >>> caused the splits that resulted in high performance, and the good score was >>> not repeated when I ran the analysis multiple times again. Instead, I got a >>> repeated lower score for them as I would expect… but the first time around, >>> I got the higher score. >>> >>> I have taken your advice and set the random_state for train_test_split, but >>> this did not solve the problem. So I am still puzzled by this. What I have >>> done in the meantime is to use the cross_val_score with 50 folds instead of >>> my own permutation procedure to see if this makes any difference. But I am >>> still really curious as to why this has arisen in the first place. I'll >>> update on this, but it takes some time for each of these analyses to run on >>> the cluster. >>> >>> Just one other thing… I noticed that my code that I put into the email is >>> displayed well in the emails but does not display very well on the archive >>> page. Does anyone have any advice for the best way to put code into these >>> emails? >>> >>> Tim >>> >>>> Message: 3 >>>> Date: Sat, 10 Jan 2015 21:25:48 +1100 >>>> From: Joel Nothman <joel.noth...@gmail.com> >>>> Subject: Re: [Scikit-learn-general] Thanks for help >>>> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net> >>>> Message-ID: >>>> <CAAkaFLVDuPvEBaVe+8Y=ucfosmvajbbfjn84wxhaz4n2dms...@mail.gmail.com> >>>> Content-Type: text/plain; charset="utf-8" >>>> >>>> Hi Timothy, >>>> >>>> You are not setting random_state for train_test_split. Please check if this >>>> fixes the problem. >>>> >>>> - Joel >>>> >>>> >>>> Ok, well once again, thank you for your reply. I will provide some of my >>>> code here and I hope that it helps. Just bear in mind that I have not had >>>> any formal training in programming, so my code will very much come under >>>> the category of 'PhDware'. >>>> >>>> So, this was the class that I wrote that gets imported into my different >>>> analysis scripts: >>>> >>>> # Filename: Clozuk_Machine_Learning_saving_random_state.py >>>> >>>> """ >>>> A module that is to be imported into other Python scripts in order to run >>>> the different algorithms on the data. >>>> """ >>>> >>>> import numpy as np >>>> import numpy.random as npr >>>> from sklearn.svm import SVC >>>> from sklearn.cross_validation import train_test_split >>>> from sklearn.metrics import roc_auc_score >>>> >>>> class DimensionError(Exception): >>>> pass >>>> >>>> class clSVC(SVC): >>>> """ >>>> ccSVC(inputs, results, save_path, C=1, kernel='rbf', tp=0.5) >>>> >>>> Sets up a Support Vector Classifier using the SVC object from the >>>> sklearn >>>> package. >>>> >>>> For detailed list of parameters and attributes - see documentation on >>>> SVC >>>> from scikit-learn >>>> >>>> Additional Parameters >>>> --------------------- >>>> inputs : numpy array >>>> A numpy array of all the input data to the model. Features >>>> represented >>>> in columns and samples represented in rows. Must be a 2D array >>>> results : numpy array >>>> A one dimensional numpy array of the results of the data >>>> save_path : string >>>> The full path to where the answers for each permutation will be >>>> saved >>>> C : float >>>> A value representing how much importance is given to fitting the >>>> training data versus regularisation. Higher C values mean that the >>>> model >>>> is fitted to the data given, but could result in over-fitting >>>> tp : float > 0 and <1 >>>> A value representing the proportion of the data used in the training >>>> set with the remainder going to the test set >>>> """ >>>> >>>> def __init__(self, inputs, results, save_path, C=1, kernel='rbf', >>>> tp=0.5): >>>> super(clSVC, self).__init__(C=C, kernel=kernel) >>>> if len(inputs.shape)==2: >>>> self.inputs = inputs >>>> else: >>>> raise DimensionError('Inputs array must be 2D') >>>> >>>> if len(results.shape)==1: >>>> self.results = results >>>> else: >>>> raise DimensionError('Results array must be 1D') >>>> >>>> self.save_path = save_path >>>> >>>> >>>> if tp >0 or tp < 1: >>>> self.tp = tp >>>> else: >>>> raise ValueError('tp must be between 0 and 1. %0.2f entered' % >>>> tp) >>>> >>>> >>>> def run_model(self, seed=None): >>>> if seed==None: >>>> seed = npr.randint(10000) >>>> npr.seed(seed) >>>> >>>> # Using this seed to set the random state as well >>>> self.random_state = seed >>>> >>>> X_train, X_test, y_train, y_test = train_test_split(self.inputs, \ >>>> self.results, >>>> train_size=self.tp) >>>> self.fit(X_train, y_train) >>>> predictions = self.predict(X_test) >>>> percentage_correct = (np.sum(predictions == y_test) / >>>> float(len(predictions))) * 100 >>>> score = roc_auc_score(y_test, predictions) >>>> >>>> # Saving every iteration just in case the walltime runs out >>>> fi = open('%s.txt' % self.save_path, 'a') >>>> fi.write('%d %f %f\n' % (seed, percentage_correct, score)) >>>> fi.close() >>>> return (seed, percentage_correct, score) >>>> >>>> >>>> >>>> >>>> And here is an example of a script that imports and runs this: >>>> >>>> # Filename: svm_125_GWAS_mp.py >>>> >>>> """ >>>> Reads in the genotypes and the phenotypes >>>> and then carries out the whole SVC on them >>>> """ >>>> >>>> import numpy as np >>>> from ClozukMachineLearning_saving_random_state import clSVC >>>> from multiprocessing.dummy import Pool as ThreadPool >>>> from sys import argv >>>> >>>> # Getting the parameter of C from the input and converting it to an integer >>>> # Also getting kernel >>>> script, C, kernel = argv >>>> C = int(C) >>>> >>>> # Reading in the inputs and the targets >>>> inputs = >>>> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy') >>>> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy') >>>> >>>> # Now building the actual model >>>> svc = clSVC(inputs, targets, C=C, >>>> save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d' >>>> % (kernel, C), kernel=kernel, tp=0.75) >>>> >>>> # Now reading in the random seeds >>>> seeds = np.load('stored_data_2015/random_seeds.npy') >>>> >>>> # Now setting up the multiprocessing pools >>>> pool = ThreadPool(16) >>>> >>>> # And mapping the learning function across these pools >>>> answers = pool.map(svc.run_model, seeds) >>>> >>>> # Closing and joining the results >>>> pool.close() >>>> pool.join() >>>> >>>> Just to recap the problem that I am having for anyone new who might be >>>> reading this. The aim of this is to try and predict case/control status of >>>> a schizophrenia dataset given information on genotypes for the different >>>> subjects. I am running this procedure 500 times for each dataset. Each >>>> time, the data is split into training and test (75%, 25% respectively) in >>>> different ways using the train_test_split function. Each split is >>>> determined by random seed integers that are read in from the file >>>> 'random_seeds.npy'. The intention was to train and test an SVC model each >>>> time, and then get a distribution of the scores. I wanted to see what the >>>> general performance was, as well as the effect of splitting the data into >>>> different training/test sets. For each permutation, the seed is used to >>>> determine the train_test_split and the random_state. >>>> >>>> I thought that by using the same seeds this would result in the same >>>> performance for the individual splits when carried out multiple times. >>>> However, while this is happening for a smaller dataset with only 125 >>>> features, on a larger one with over 31,000 features, it is not. For the >>>> majority of the splits, I am getting ROC scores of around 0.6, which is >>>> what I would expect; but in about 15% of the splits, the scores are >>>> grouped very highly, around 0.9. When I first saw this, I wanted to find >>>> out what was different about those particular train/test splits was >>>> causing the high scores, so I isolated the seeds that resulted in good >>>> performance, and ran the procedure on those alone. Instead of seeing a >>>> repetition of the high performance, I got a similar pattern that was >>>> happening before: most scores were around 0.6, and about 15% were around >>>> 0.9. So the performance for the same splits of the data, with the >>>> random_state being set the same for both runs, are giving different >>>> results. >>>> >>>> I decided to look at some of the seeds which resulted in high scores both >>>> times, by running this procedure on them multiple times. Then I am getting >>>> what I was expecting: the same answer (around 0.6) repeated for each run >>>> of each seed. >>>> >>>> So what is puzzling me is why any of these high scores appeared at all in >>>> the first place? >>>> >>>> Tim >>>> >>>> On 7 Jan 2015, at 16:18, Andy <t3k...@gmail.com<mailto:t3k...@gmail.com>> >>>> wrote: >>>> >>>> Hi Tim. >>>> Please keep all discussions on the mailing list, as individual >>>> contributors might not find the time to respond. >>>> With fixed random seeds, results should be reproducible. If you provide >>>> the full code, it might be possible to say where the problem lies. >>>> >>>> Sorry, I meant ShuffleSplitCV, but you can also use any other CV object. >>>> The cross_val_predict function is not super essential, just a convenience. >>>> You should start with computing the scores using cross_val_score. >>>> >>>> For otherwise maintained clusters: installing a version locally is super >>>> easy. Just check out the dev version and set the pythonpath appropriately, >>>> or install into a virtual environment. Many people have custom python >>>> libraries (or whole environments) installed. >>>> >>>> Cheers, >>>> Andy >>>> >>>> >>>> On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote: >>>> Dear Andy, >>>> >>>> Thank you for your reply on the scikit-learn problem I was having. Seeing >>>> as I am new to this, I am writing to you directly; is this what I should >>>> do, or should I reply to the response email that you gave me? >>>> >>>> As for the reproducability, I have not set the probability to be True, so >>>> it should be running on the default. I am also setting the random state >>>> parameter, so I am puzzled as to what is happening. I haven't found a >>>> single split that is reproducing high performing results. I understand >>>> that there will be discrepancies in the data, but I don't understand why >>>> splits should perform differently on different occasions. >>>> >>>> Just another thing, I have noticed that the cross_val_predict function >>>> that you mention is in the latest version of sklearn, but I cannot find >>>> the RandomizedSplitCV one. Also, seeing as I am running my code on a >>>> cluster which I do not maintain, I think it's probably best if I wait >>>> until 0.16 becomes the stable version before I ask the admins to update. >>>> Do you have any idea of when this might be? >>>> >>>> Thanks for your help and apologies if I am not supposed to contact your >>>> email address directly, >>>> >>>> Tim >>>> >>>> >>> ------------------------------------------------------------------------------ >>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA. >>> GigeNET is offering a free month of service with a new server in Ashburn. >>> Choose from 2 high performing configs, both with 100TB of bandwidth. >>> Higher redundancy.Lower latency.Increased capacity.Completely compliant. >>> http://p.sf.net/sfu/gigenet >>> _______________________________________________ >>> Scikit-learn-general mailing list >>> Scikit-learn-general@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> > > ------------------------------------------------------------------------------ > New Year. New Location. New Benefits. New Data Center in Ashburn, VA. > GigeNET is offering a free month of service with a new server in Ashburn. > Choose from 2 high performing configs, both with 100TB of bandwidth. > Higher redundancy.Lower latency.Increased capacity.Completely compliant. > http://p.sf.net/sfu/gigenet > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general