Hi Andy,

*This is actually the second time that I sending this reply as I don't think 
that it got to the mailing list, and I cannot see it in the archives.*

Yes, I’m back to receiving the emails now. I noticed that in the previous 
thread, there was some message about the website being down? But anyway, it’s 
working now. Also, I noticed that my message with my code was repeated there, 
and in the top case it looked awful, but lower down it was fine… so I don’t 
know what was going on there. 

But back to the problem at hand… yes I’m still getting the strange cases of 
‘rogue’ high scores appearing when doing the 500 permutations, but none of 
these is being repeated when these splits are scrutinised. What I have done in 
the meantime is take your advice and use the cross_val_score function, with 50 
fold CV (leaving 155 or 156 in each test set) and this is seeming to work 
really well in the smaller subset that I have. But from another thread, I have 
found out that this is using the svm.decision_function instead of svm.predict 
to get the scores, and I'm not familiar enough with this to know if that would 
make a difference.

In fact, the only reason that I was using my method was because I really wanted 
to examine if the results were very susceptible to the splits of the data into 
training and test. As k-fold only allows each sample to be in the test set 
once, I thought this would allow for more flexibility. But I really like the 
sound of the cross_val_predict function that you mentioned, so I will install 
that version and give it a try.

As for the problem… I still don’t know why that is occurring. My only guess is 
that something is happening with the pool.map over the different CPUs on the 
cluster. I don’t do any parallel programming myself though so I don’t know how 
to delve into it. Could this be a reason for this to happen?

Tim

On 14 Jan 2015, at 02:28, Andy <t3k...@gmail.com> wrote:

> Hi Tim.
> So are you saying that even when fixing the random_state in train_test_split, 
> it is not reproducible?
> The archive code looks fine to me: 
> http://sourceforge.net/p/scikit-learn/mailman/message/33220124/
> Did you subscribe to the mailing list? Otherwise you will not get replies 
> (except this one where I explicitly copied you into the "to").
> 
> Andy
> 
> On 01/13/2015 11:08 AM, Timothy Vivian-Griffiths wrote:
>> Dear Joel,
>> 
>> Thanks for the reply here. Apologies for the later delay, but I have not 
>> been receiving the email updates, and I only noticed your reply when I 
>> looked on the archive.
>> 
>> My problem has now arisen in a smaller dataset with only 125 features (small 
>> in genetics) and 7633 samples. Again I isolated out the seeds which caused 
>> the splits that resulted in high performance, and the good score was not 
>> repeated when I ran the analysis multiple times again. Instead, I got a 
>> repeated lower score for them as I would expect… but the first time around, 
>> I got the higher score.
>> 
>> I have taken your advice and set the random_state for train_test_split, but 
>> this did not solve the problem. So I am still puzzled by this. What I have 
>> done in the meantime is to use the cross_val_score with 50 folds instead of 
>> my own permutation procedure to see if this makes any difference. But I am 
>> still really curious as to why this has arisen in the first place. I'll 
>> update on this, but it takes some time for each of these analyses to run on 
>> the cluster.
>> 
>> Just one other thing… I noticed that my code that I put into the email is 
>> displayed well in the emails but does not display very well on the archive 
>> page. Does anyone have any advice for the best way to put code into these 
>> emails?
>> 
>> Tim
>> 
>>> 
>>> Message: 3
>>> Date: Sat, 10 Jan 2015 21:25:48 +1100
>>> From: Joel Nothman <joel.noth...@gmail.com>
>>> Subject: Re: [Scikit-learn-general] Thanks for help
>>> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
>>> Message-ID:
>>>     <CAAkaFLVDuPvEBaVe+8Y=ucfosmvajbbfjn84wxhaz4n2dms...@mail.gmail.com>
>>> Content-Type: text/plain; charset="utf-8"
>>> 
>>> Hi Timothy,
>>> 
>>> You are not setting random_state for train_test_split. Please check if this
>>> fixes the problem.
>>> 
>>> - Joel
>>> 
>>> 
>>> Ok, well once again, thank you for your reply. I will provide some of my 
>>> code here and I hope that it helps. Just bear in mind that I have not had 
>>> any formal training in programming, so my code will very much come under 
>>> the category of 'PhDware'.
>>> 
>>> So, this was the class that I wrote that gets imported into my different 
>>> analysis scripts:
>>> 
>>> # Filename: Clozuk_Machine_Learning_saving_random_state.py
>>> 
>>> """
>>> A module that is to be imported into other Python scripts in order to run
>>> the different algorithms on the data.
>>> """
>>> 
>>> import numpy as np
>>> import numpy.random as npr
>>> from sklearn.svm import SVC
>>> from sklearn.cross_validation import train_test_split
>>> from sklearn.metrics import roc_auc_score
>>> 
>>> class DimensionError(Exception):
>>>   pass
>>> 
>>> class clSVC(SVC):
>>>   """
>>>   ccSVC(inputs, results, save_path, C=1, kernel='rbf', tp=0.5)
>>> 
>>>   Sets up a Support Vector Classifier using the SVC object from the sklearn
>>>   package.
>>> 
>>>   For detailed list of parameters and attributes - see documentation on SVC
>>>   from scikit-learn
>>> 
>>>   Additional Parameters
>>>   ---------------------
>>>   inputs : numpy array
>>>       A numpy array of all the input data to the model. Features represented
>>>       in columns and samples represented in rows. Must be a 2D array
>>>   results : numpy array
>>>       A one dimensional numpy array of the results of the data
>>>   save_path : string
>>>        The full path to where the answers for each permutation will be saved
>>>   C : float
>>>       A value representing how much importance is given to fitting the
>>>       training data versus regularisation. Higher C values mean that the 
>>> model
>>>       is fitted to the data given, but could result in over-fitting
>>>   tp : float > 0 and <1
>>>       A value representing the proportion of the data used in the training
>>>       set with the remainder going to the test set
>>>   """
>>> 
>>>   def __init__(self, inputs, results, save_path, C=1, kernel='rbf', tp=0.5):
>>>       super(clSVC, self).__init__(C=C, kernel=kernel)
>>>       if len(inputs.shape)==2:
>>>           self.inputs = inputs
>>>       else:
>>>           raise DimensionError('Inputs array must be 2D')
>>> 
>>>       if len(results.shape)==1:
>>>           self.results = results
>>>       else:
>>>           raise DimensionError('Results array must be 1D')
>>> 
>>>       self.save_path = save_path
>>> 
>>> 
>>>       if tp >0 or tp < 1:
>>>           self.tp = tp
>>>       else:
>>>           raise ValueError('tp must be between 0 and 1. %0.2f entered' % tp)
>>> 
>>> 
>>>   def run_model(self, seed=None):
>>>       if seed==None:
>>>           seed = npr.randint(10000)
>>>       npr.seed(seed)
>>> 
>>>       # Using this seed to set the random state as well
>>>       self.random_state = seed
>>> 
>>>       X_train, X_test, y_train, y_test = train_test_split(self.inputs, \
>>>                                           self.results, train_size=self.tp)
>>>       self.fit(X_train, y_train)
>>>       predictions = self.predict(X_test)
>>>       percentage_correct = (np.sum(predictions == y_test) / 
>>> float(len(predictions))) * 100
>>>       score = roc_auc_score(y_test, predictions)
>>> 
>>>       # Saving every iteration just in case the walltime runs out
>>>       fi = open('%s.txt' % self.save_path, 'a')
>>>       fi.write('%d %f %f\n' % (seed, percentage_correct, score))
>>>       fi.close()
>>>       return (seed, percentage_correct, score)
>>> 
>>> 
>>> 
>>> 
>>> And here is an example of a script that imports and runs this:
>>> 
>>> # Filename: svm_125_GWAS_mp.py
>>> 
>>> """
>>> Reads in the genotypes and the phenotypes
>>> and then carries out the whole SVC on them
>>> """
>>> 
>>> import numpy as np
>>> from ClozukMachineLearning_saving_random_state import clSVC
>>> from multiprocessing.dummy import Pool as ThreadPool
>>> from sys import argv
>>> 
>>> # Getting the parameter of C from the input and converting it to an integer
>>> # Also getting kernel
>>> script, C, kernel = argv
>>> C = int(C)
>>> 
>>> # Reading in the inputs and the targets
>>> inputs = 
>>> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
>>> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
>>> 
>>> # Now building the actual model
>>> svc = clSVC(inputs, targets, C=C, 
>>> save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d'
>>>  % (kernel, C), kernel=kernel, tp=0.75)
>>> 
>>> # Now reading in the random seeds
>>> seeds = np.load('stored_data_2015/random_seeds.npy')
>>> 
>>> # Now setting up the multiprocessing pools
>>> pool = ThreadPool(16)
>>> 
>>> # And mapping the learning function across these pools
>>> answers = pool.map(svc.run_model, seeds)
>>> 
>>> # Closing and joining the results
>>> pool.close()
>>> pool.join()
>>> 
>>> Just to recap the problem that I am having for anyone new who might be 
>>> reading this. The aim of this is to try and predict case/control status of 
>>> a schizophrenia dataset given information on genotypes for the different 
>>> subjects.  I am running this procedure 500 times for each dataset. Each 
>>> time, the data is split into training and test (75%, 25% respectively) in 
>>> different ways using the train_test_split function. Each split is 
>>> determined by random seed integers that are read in from the file 
>>> 'random_seeds.npy'. The intention was to train and test an SVC model each 
>>> time, and then get a distribution of the scores. I wanted to see what the 
>>> general performance was, as well as the effect of splitting the data into 
>>> different training/test sets. For each permutation, the seed is used to 
>>> determine the train_test_split and the random_state.
>>> 
>>> I thought that by using the same seeds this would result in the same 
>>> performance for the individual splits when carried out multiple times. 
>>> However, while this is happening for a smaller dataset with only 125 
>>> features, on a larger one with over 31,000 features, it is not. For the 
>>> majority of the splits, I am getting ROC scores of around 0.6, which is 
>>> what I would expect; but in about 15% of the splits, the scores are grouped 
>>> very highly, around 0.9. When I first saw this, I wanted to find out what 
>>> was different about those particular train/test splits was causing the high 
>>> scores, so I isolated the seeds that resulted in good performance, and ran 
>>> the procedure on those alone. Instead of seeing a repetition of the high 
>>> performance, I got a similar pattern that was happening before: most scores 
>>> were around 0.6, and about 15% were around 0.9. So the performance for the 
>>> same splits of the data, with the random_state being set the same for both 
>>> runs, are giving different results.
>>> 
>>> I decided to look at some of the seeds which resulted in high scores both 
>>> times, by running this procedure on them multiple times. Then I am getting 
>>> what I was expecting: the same answer (around 0.6) repeated for each run of 
>>> each seed.
>>> 
>>> So what is puzzling me is why any of these high scores appeared at all in 
>>> the first place?
>>> 
>>> Tim
>>> 
>>> On 7 Jan 2015, at 16:18, Andy <t3k...@gmail.com<mailto:t3k...@gmail.com>> 
>>> wrote:
>>> 
>>> Hi Tim.
>>> Please keep all discussions on the mailing list, as individual contributors 
>>> might not find the time to respond.
>>> With fixed random seeds, results should be reproducible. If you provide the 
>>> full code, it might be possible to say where the problem lies.
>>> 
>>> Sorry, I meant ShuffleSplitCV, but you can also use any other CV object.
>>> The cross_val_predict function is not super essential, just a convenience. 
>>> You should start with computing the scores using cross_val_score.
>>> 
>>> For otherwise maintained clusters: installing a version locally is super 
>>> easy. Just check out the dev version and set the pythonpath appropriately,
>>> or install into a virtual environment. Many people have custom python 
>>> libraries (or whole environments) installed.
>>> 
>>> Cheers,
>>> Andy
>>> 
>>> 
>>> On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote:
>>> Dear Andy,
>>> 
>>> Thank you for your reply on the scikit-learn problem I was having. Seeing 
>>> as I am new to this, I am writing to you directly; is this what I should 
>>> do, or should I reply to the response email that you gave me?
>>> 
>>> As for the reproducability, I have not set the probability to be True, so 
>>> it should be running on the default. I am also setting the random state 
>>> parameter, so I am puzzled as to what is happening. I haven't found a 
>>> single split that is reproducing high performing results. I understand that 
>>> there will be discrepancies in the data, but I don't understand why splits 
>>> should perform differently on different occasions.
>>> 
>>> Just another thing, I have noticed that the cross_val_predict function that 
>>> you mention is in the latest version of sklearn, but I cannot find the 
>>> RandomizedSplitCV one. Also, seeing as I am running my code on a cluster 
>>> which I do not maintain, I think it's probably best if I wait until 0.16 
>>> becomes the stable version before I ask the admins to update. Do you have 
>>> any idea of when this might be?
>>> 
>>> Thanks for your help and apologies if I am not supposed to contact your 
>>> email address directly,
>>> 
>>> Tim
>>> 
>>> 
>> 
>> ------------------------------------------------------------------------------
>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
>> GigeNET is offering a free month of service with a new server in Ashburn.
>> Choose from 2 high performing configs, both with 100TB of bandwidth.
>> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
>> http://p.sf.net/sfu/gigenet
>> _______________________________________________
>> Scikit-learn-general mailing list
>> Scikit-learn-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
> 
> 


------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to