Hi Tim.

I think it is highly unlikely that splits are not reproducible, and that 
this is caused by parallelization.
If you run ``cross_val_score`` twice with the same seed, the outcomes 
are exactly the same, right?
How are you trying and failing to reproduce the scores?

Best,
Andy


On 01/19/2015 11:21 AM, Timothy Vivian-Griffiths wrote:
> Hi Andy,
>
> *This is actually the second time that I sending this reply as I don't think 
> that it got to the mailing list, and I cannot see it in the archives.*
>
> Yes, I’m back to receiving the emails now. I noticed that in the previous 
> thread, there was some message about the website being down? But anyway, it’s 
> working now. Also, I noticed that my message with my code was repeated there, 
> and in the top case it looked awful, but lower down it was fine… so I don’t 
> know what was going on there.
>
> But back to the problem at hand… yes I’m still getting the strange cases of 
> ‘rogue’ high scores appearing when doing the 500 permutations, but none of 
> these is being repeated when these splits are scrutinised. What I have done 
> in the meantime is take your advice and use the cross_val_score function, 
> with 50 fold CV (leaving 155 or 156 in each test set) and this is seeming to 
> work really well in the smaller subset that I have. But from another thread, 
> I have found out that this is using the svm.decision_function instead of 
> svm.predict to get the scores, and I'm not familiar enough with this to know 
> if that would make a difference.
>
> In fact, the only reason that I was using my method was because I really 
> wanted to examine if the results were very susceptible to the splits of the 
> data into training and test. As k-fold only allows each sample to be in the 
> test set once, I thought this would allow for more flexibility. But I really 
> like the sound of the cross_val_predict function that you mentioned, so I 
> will install that version and give it a try.
>
> As for the problem… I still don’t know why that is occurring. My only guess 
> is that something is happening with the pool.map over the different CPUs on 
> the cluster. I don’t do any parallel programming myself though so I don’t 
> know how to delve into it. Could this be a reason for this to happen?
>
> Tim
>
> On 14 Jan 2015, at 02:28, Andy <t3k...@gmail.com> wrote:
>
>> Hi Tim.
>> So are you saying that even when fixing the random_state in 
>> train_test_split, it is not reproducible?
>> The archive code looks fine to me: 
>> http://sourceforge.net/p/scikit-learn/mailman/message/33220124/
>> Did you subscribe to the mailing list? Otherwise you will not get replies 
>> (except this one where I explicitly copied you into the "to").
>>
>> Andy
>>
>> On 01/13/2015 11:08 AM, Timothy Vivian-Griffiths wrote:
>>> Dear Joel,
>>>
>>> Thanks for the reply here. Apologies for the later delay, but I have not 
>>> been receiving the email updates, and I only noticed your reply when I 
>>> looked on the archive.
>>>
>>> My problem has now arisen in a smaller dataset with only 125 features 
>>> (small in genetics) and 7633 samples. Again I isolated out the seeds which 
>>> caused the splits that resulted in high performance, and the good score was 
>>> not repeated when I ran the analysis multiple times again. Instead, I got a 
>>> repeated lower score for them as I would expect… but the first time around, 
>>> I got the higher score.
>>>
>>> I have taken your advice and set the random_state for train_test_split, but 
>>> this did not solve the problem. So I am still puzzled by this. What I have 
>>> done in the meantime is to use the cross_val_score with 50 folds instead of 
>>> my own permutation procedure to see if this makes any difference. But I am 
>>> still really curious as to why this has arisen in the first place. I'll 
>>> update on this, but it takes some time for each of these analyses to run on 
>>> the cluster.
>>>
>>> Just one other thing… I noticed that my code that I put into the email is 
>>> displayed well in the emails but does not display very well on the archive 
>>> page. Does anyone have any advice for the best way to put code into these 
>>> emails?
>>>
>>> Tim
>>>
>>>> Message: 3
>>>> Date: Sat, 10 Jan 2015 21:25:48 +1100
>>>> From: Joel Nothman <joel.noth...@gmail.com>
>>>> Subject: Re: [Scikit-learn-general] Thanks for help
>>>> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
>>>> Message-ID:
>>>>    <CAAkaFLVDuPvEBaVe+8Y=ucfosmvajbbfjn84wxhaz4n2dms...@mail.gmail.com>
>>>> Content-Type: text/plain; charset="utf-8"
>>>>
>>>> Hi Timothy,
>>>>
>>>> You are not setting random_state for train_test_split. Please check if this
>>>> fixes the problem.
>>>>
>>>> - Joel
>>>>
>>>>
>>>> Ok, well once again, thank you for your reply. I will provide some of my 
>>>> code here and I hope that it helps. Just bear in mind that I have not had 
>>>> any formal training in programming, so my code will very much come under 
>>>> the category of 'PhDware'.
>>>>
>>>> So, this was the class that I wrote that gets imported into my different 
>>>> analysis scripts:
>>>>
>>>> # Filename: Clozuk_Machine_Learning_saving_random_state.py
>>>>
>>>> """
>>>> A module that is to be imported into other Python scripts in order to run
>>>> the different algorithms on the data.
>>>> """
>>>>
>>>> import numpy as np
>>>> import numpy.random as npr
>>>> from sklearn.svm import SVC
>>>> from sklearn.cross_validation import train_test_split
>>>> from sklearn.metrics import roc_auc_score
>>>>
>>>> class DimensionError(Exception):
>>>>    pass
>>>>
>>>> class clSVC(SVC):
>>>>    """
>>>>    ccSVC(inputs, results, save_path, C=1, kernel='rbf', tp=0.5)
>>>>
>>>>    Sets up a Support Vector Classifier using the SVC object from the 
>>>> sklearn
>>>>    package.
>>>>
>>>>    For detailed list of parameters and attributes - see documentation on 
>>>> SVC
>>>>    from scikit-learn
>>>>
>>>>    Additional Parameters
>>>>    ---------------------
>>>>    inputs : numpy array
>>>>        A numpy array of all the input data to the model. Features 
>>>> represented
>>>>        in columns and samples represented in rows. Must be a 2D array
>>>>    results : numpy array
>>>>        A one dimensional numpy array of the results of the data
>>>>    save_path : string
>>>>         The full path to where the answers for each permutation will be 
>>>> saved
>>>>    C : float
>>>>        A value representing how much importance is given to fitting the
>>>>        training data versus regularisation. Higher C values mean that the 
>>>> model
>>>>        is fitted to the data given, but could result in over-fitting
>>>>    tp : float > 0 and <1
>>>>        A value representing the proportion of the data used in the training
>>>>        set with the remainder going to the test set
>>>>    """
>>>>
>>>>    def __init__(self, inputs, results, save_path, C=1, kernel='rbf', 
>>>> tp=0.5):
>>>>        super(clSVC, self).__init__(C=C, kernel=kernel)
>>>>        if len(inputs.shape)==2:
>>>>            self.inputs = inputs
>>>>        else:
>>>>            raise DimensionError('Inputs array must be 2D')
>>>>
>>>>        if len(results.shape)==1:
>>>>            self.results = results
>>>>        else:
>>>>            raise DimensionError('Results array must be 1D')
>>>>
>>>>        self.save_path = save_path
>>>>
>>>>
>>>>        if tp >0 or tp < 1:
>>>>            self.tp = tp
>>>>        else:
>>>>            raise ValueError('tp must be between 0 and 1. %0.2f entered' % 
>>>> tp)
>>>>
>>>>
>>>>    def run_model(self, seed=None):
>>>>        if seed==None:
>>>>            seed = npr.randint(10000)
>>>>        npr.seed(seed)
>>>>
>>>>        # Using this seed to set the random state as well
>>>>        self.random_state = seed
>>>>
>>>>        X_train, X_test, y_train, y_test = train_test_split(self.inputs, \
>>>>                                            self.results, 
>>>> train_size=self.tp)
>>>>        self.fit(X_train, y_train)
>>>>        predictions = self.predict(X_test)
>>>>        percentage_correct = (np.sum(predictions == y_test) / 
>>>> float(len(predictions))) * 100
>>>>        score = roc_auc_score(y_test, predictions)
>>>>
>>>>        # Saving every iteration just in case the walltime runs out
>>>>        fi = open('%s.txt' % self.save_path, 'a')
>>>>        fi.write('%d %f %f\n' % (seed, percentage_correct, score))
>>>>        fi.close()
>>>>        return (seed, percentage_correct, score)
>>>>
>>>>
>>>>
>>>>
>>>> And here is an example of a script that imports and runs this:
>>>>
>>>> # Filename: svm_125_GWAS_mp.py
>>>>
>>>> """
>>>> Reads in the genotypes and the phenotypes
>>>> and then carries out the whole SVC on them
>>>> """
>>>>
>>>> import numpy as np
>>>> from ClozukMachineLearning_saving_random_state import clSVC
>>>> from multiprocessing.dummy import Pool as ThreadPool
>>>> from sys import argv
>>>>
>>>> # Getting the parameter of C from the input and converting it to an integer
>>>> # Also getting kernel
>>>> script, C, kernel = argv
>>>> C = int(C)
>>>>
>>>> # Reading in the inputs and the targets
>>>> inputs = 
>>>> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
>>>> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
>>>>
>>>> # Now building the actual model
>>>> svc = clSVC(inputs, targets, C=C, 
>>>> save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d'
>>>>  % (kernel, C), kernel=kernel, tp=0.75)
>>>>
>>>> # Now reading in the random seeds
>>>> seeds = np.load('stored_data_2015/random_seeds.npy')
>>>>
>>>> # Now setting up the multiprocessing pools
>>>> pool = ThreadPool(16)
>>>>
>>>> # And mapping the learning function across these pools
>>>> answers = pool.map(svc.run_model, seeds)
>>>>
>>>> # Closing and joining the results
>>>> pool.close()
>>>> pool.join()
>>>>
>>>> Just to recap the problem that I am having for anyone new who might be 
>>>> reading this. The aim of this is to try and predict case/control status of 
>>>> a schizophrenia dataset given information on genotypes for the different 
>>>> subjects.  I am running this procedure 500 times for each dataset. Each 
>>>> time, the data is split into training and test (75%, 25% respectively) in 
>>>> different ways using the train_test_split function. Each split is 
>>>> determined by random seed integers that are read in from the file 
>>>> 'random_seeds.npy'. The intention was to train and test an SVC model each 
>>>> time, and then get a distribution of the scores. I wanted to see what the 
>>>> general performance was, as well as the effect of splitting the data into 
>>>> different training/test sets. For each permutation, the seed is used to 
>>>> determine the train_test_split and the random_state.
>>>>
>>>> I thought that by using the same seeds this would result in the same 
>>>> performance for the individual splits when carried out multiple times. 
>>>> However, while this is happening for a smaller dataset with only 125 
>>>> features, on a larger one with over 31,000 features, it is not. For the 
>>>> majority of the splits, I am getting ROC scores of around 0.6, which is 
>>>> what I would expect; but in about 15% of the splits, the scores are 
>>>> grouped very highly, around 0.9. When I first saw this, I wanted to find 
>>>> out what was different about those particular train/test splits was 
>>>> causing the high scores, so I isolated the seeds that resulted in good 
>>>> performance, and ran the procedure on those alone. Instead of seeing a 
>>>> repetition of the high performance, I got a similar pattern that was 
>>>> happening before: most scores were around 0.6, and about 15% were around 
>>>> 0.9. So the performance for the same splits of the data, with the 
>>>> random_state being set the same for both runs, are giving different 
>>>> results.
>>>>
>>>> I decided to look at some of the seeds which resulted in high scores both 
>>>> times, by running this procedure on them multiple times. Then I am getting 
>>>> what I was expecting: the same answer (around 0.6) repeated for each run 
>>>> of each seed.
>>>>
>>>> So what is puzzling me is why any of these high scores appeared at all in 
>>>> the first place?
>>>>
>>>> Tim
>>>>
>>>> On 7 Jan 2015, at 16:18, Andy <t3k...@gmail.com<mailto:t3k...@gmail.com>> 
>>>> wrote:
>>>>
>>>> Hi Tim.
>>>> Please keep all discussions on the mailing list, as individual 
>>>> contributors might not find the time to respond.
>>>> With fixed random seeds, results should be reproducible. If you provide 
>>>> the full code, it might be possible to say where the problem lies.
>>>>
>>>> Sorry, I meant ShuffleSplitCV, but you can also use any other CV object.
>>>> The cross_val_predict function is not super essential, just a convenience. 
>>>> You should start with computing the scores using cross_val_score.
>>>>
>>>> For otherwise maintained clusters: installing a version locally is super 
>>>> easy. Just check out the dev version and set the pythonpath appropriately,
>>>> or install into a virtual environment. Many people have custom python 
>>>> libraries (or whole environments) installed.
>>>>
>>>> Cheers,
>>>> Andy
>>>>
>>>>
>>>> On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote:
>>>> Dear Andy,
>>>>
>>>> Thank you for your reply on the scikit-learn problem I was having. Seeing 
>>>> as I am new to this, I am writing to you directly; is this what I should 
>>>> do, or should I reply to the response email that you gave me?
>>>>
>>>> As for the reproducability, I have not set the probability to be True, so 
>>>> it should be running on the default. I am also setting the random state 
>>>> parameter, so I am puzzled as to what is happening. I haven't found a 
>>>> single split that is reproducing high performing results. I understand 
>>>> that there will be discrepancies in the data, but I don't understand why 
>>>> splits should perform differently on different occasions.
>>>>
>>>> Just another thing, I have noticed that the cross_val_predict function 
>>>> that you mention is in the latest version of sklearn, but I cannot find 
>>>> the RandomizedSplitCV one. Also, seeing as I am running my code on a 
>>>> cluster which I do not maintain, I think it's probably best if I wait 
>>>> until 0.16 becomes the stable version before I ask the admins to update. 
>>>> Do you have any idea of when this might be?
>>>>
>>>> Thanks for your help and apologies if I am not supposed to contact your 
>>>> email address directly,
>>>>
>>>> Tim
>>>>
>>>>
>>> ------------------------------------------------------------------------------
>>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
>>> GigeNET is offering a free month of service with a new server in Ashburn.
>>> Choose from 2 high performing configs, both with 100TB of bandwidth.
>>> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
>>> http://p.sf.net/sfu/gigenet
>>> _______________________________________________
>>> Scikit-learn-general mailing list
>>> Scikit-learn-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>>
>
> ------------------------------------------------------------------------------
> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
> GigeNET is offering a free month of service with a new server in Ashburn.
> Choose from 2 high performing configs, both with 100TB of bandwidth.
> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
> http://p.sf.net/sfu/gigenet
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general



------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to