Dear Joel,

Thanks for the reply here. Apologies for the later delay, but I have not been 
receiving the email updates, and I only noticed your reply when I looked on the 
archive. 

My problem has now arisen in a smaller dataset with only 125 features (small in 
genetics) and 7633 samples. Again I isolated out the seeds which caused the 
splits that resulted in high performance, and the good score was not repeated 
when I ran the analysis multiple times again. Instead, I got a repeated lower 
score for them as I would expect… but the first time around, I got the higher 
score.

I have taken your advice and set the random_state for train_test_split, but 
this did not solve the problem. So I am still puzzled by this. What I have done 
in the meantime is to use the cross_val_score with 50 folds instead of my own 
permutation procedure to see if this makes any difference. But I am still 
really curious as to why this has arisen in the first place. I'll update on 
this, but it takes some time for each of these analyses to run on the cluster.

Just one other thing… I noticed that my code that I put into the email is 
displayed well in the emails but does not display very well on the archive 
page. Does anyone have any advice for the best way to put code into these 
emails?

Tim

> 
> 
> Message: 3
> Date: Sat, 10 Jan 2015 21:25:48 +1100
> From: Joel Nothman <joel.noth...@gmail.com>
> Subject: Re: [Scikit-learn-general] Thanks for help
> To: scikit-learn-general <scikit-learn-general@lists.sourceforge.net>
> Message-ID:
>       <CAAkaFLVDuPvEBaVe+8Y=ucfosmvajbbfjn84wxhaz4n2dms...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi Timothy,
> 
> You are not setting random_state for train_test_split. Please check if this
> fixes the problem.
> 
> - Joel
> 
> 
> Ok, well once again, thank you for your reply. I will provide some of my code 
> here and I hope that it helps. Just bear in mind that I have not had any 
> formal training in programming, so my code will very much come under the 
> category of 'PhDware'.
> 
> So, this was the class that I wrote that gets imported into my different 
> analysis scripts:
> 
> # Filename: Clozuk_Machine_Learning_saving_random_state.py
> 
> """
> A module that is to be imported into other Python scripts in order to run
> the different algorithms on the data.
> """
> 
> import numpy as np
> import numpy.random as npr
> from sklearn.svm import SVC
> from sklearn.cross_validation import train_test_split
> from sklearn.metrics import roc_auc_score
> 
> class DimensionError(Exception):
>    pass
> 
> class clSVC(SVC):
>    """
>    ccSVC(inputs, results, save_path, C=1, kernel='rbf', tp=0.5)
> 
>    Sets up a Support Vector Classifier using the SVC object from the sklearn
>    package.
> 
>    For detailed list of parameters and attributes - see documentation on SVC
>    from scikit-learn
> 
>    Additional Parameters
>    ---------------------
>    inputs : numpy array
>        A numpy array of all the input data to the model. Features represented
>        in columns and samples represented in rows. Must be a 2D array
>    results : numpy array
>        A one dimensional numpy array of the results of the data
>    save_path : string
>         The full path to where the answers for each permutation will be saved
>    C : float
>        A value representing how much importance is given to fitting the
>        training data versus regularisation. Higher C values mean that the 
> model
>        is fitted to the data given, but could result in over-fitting
>    tp : float > 0 and <1
>        A value representing the proportion of the data used in the training
>        set with the remainder going to the test set
>    """
> 
>    def __init__(self, inputs, results, save_path, C=1, kernel='rbf', tp=0.5):
>        super(clSVC, self).__init__(C=C, kernel=kernel)
>        if len(inputs.shape)==2:
>            self.inputs = inputs
>        else:
>            raise DimensionError('Inputs array must be 2D')
> 
>        if len(results.shape)==1:
>            self.results = results
>        else:
>            raise DimensionError('Results array must be 1D')
> 
>        self.save_path = save_path
> 
> 
>        if tp >0 or tp < 1:
>            self.tp = tp
>        else:
>            raise ValueError('tp must be between 0 and 1. %0.2f entered' % tp)
> 
> 
>    def run_model(self, seed=None):
>        if seed==None:
>            seed = npr.randint(10000)
>        npr.seed(seed)
> 
>        # Using this seed to set the random state as well
>        self.random_state = seed
> 
>        X_train, X_test, y_train, y_test = train_test_split(self.inputs, \
>                                            self.results, train_size=self.tp)
>        self.fit(X_train, y_train)
>        predictions = self.predict(X_test)
>        percentage_correct = (np.sum(predictions == y_test) / 
> float(len(predictions))) * 100
>        score = roc_auc_score(y_test, predictions)
> 
>        # Saving every iteration just in case the walltime runs out
>        fi = open('%s.txt' % self.save_path, 'a')
>        fi.write('%d %f %f\n' % (seed, percentage_correct, score))
>        fi.close()
>        return (seed, percentage_correct, score)
> 
> 
> 
> 
> And here is an example of a script that imports and runs this:
> 
> # Filename: svm_125_GWAS_mp.py
> 
> """
> Reads in the genotypes and the phenotypes
> and then carries out the whole SVC on them
> """
> 
> import numpy as np
> from ClozukMachineLearning_saving_random_state import clSVC
> from multiprocessing.dummy import Pool as ThreadPool
> from sys import argv
> 
> # Getting the parameter of C from the input and converting it to an integer
> # Also getting kernel
> script, C, kernel = argv
> C = int(C)
> 
> # Reading in the inputs and the targets
> inputs = 
> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
> 
> # Now building the actual model
> svc = clSVC(inputs, targets, C=C, 
> save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d'
>  % (kernel, C), kernel=kernel, tp=0.75)
> 
> # Now reading in the random seeds
> seeds = np.load('stored_data_2015/random_seeds.npy')
> 
> # Now setting up the multiprocessing pools
> pool = ThreadPool(16)
> 
> # And mapping the learning function across these pools
> answers = pool.map(svc.run_model, seeds)
> 
> # Closing and joining the results
> pool.close()
> pool.join()
> 
> Just to recap the problem that I am having for anyone new who might be 
> reading this. The aim of this is to try and predict case/control status of a 
> schizophrenia dataset given information on genotypes for the different 
> subjects.  I am running this procedure 500 times for each dataset. Each time, 
> the data is split into training and test (75%, 25% respectively) in different 
> ways using the train_test_split function. Each split is determined by random 
> seed integers that are read in from the file 'random_seeds.npy'. The 
> intention was to train and test an SVC model each time, and then get a 
> distribution of the scores. I wanted to see what the general performance was, 
> as well as the effect of splitting the data into different training/test 
> sets. For each permutation, the seed is used to determine the 
> train_test_split and the random_state.
> 
> I thought that by using the same seeds this would result in the same 
> performance for the individual splits when carried out multiple times. 
> However, while this is happening for a smaller dataset with only 125 
> features, on a larger one with over 31,000 features, it is not. For the 
> majority of the splits, I am getting ROC scores of around 0.6, which is what 
> I would expect; but in about 15% of the splits, the scores are grouped very 
> highly, around 0.9. When I first saw this, I wanted to find out what was 
> different about those particular train/test splits was causing the high 
> scores, so I isolated the seeds that resulted in good performance, and ran 
> the procedure on those alone. Instead of seeing a repetition of the high 
> performance, I got a similar pattern that was happening before: most scores 
> were around 0.6, and about 15% were around 0.9. So the performance for the 
> same splits of the data, with the random_state being set the same for both 
> runs, are giving different results.
> 
> I decided to look at some of the seeds which resulted in high scores both 
> times, by running this procedure on them multiple times. Then I am getting 
> what I was expecting: the same answer (around 0.6) repeated for each run of 
> each seed.
> 
> So what is puzzling me is why any of these high scores appeared at all in the 
> first place?
> 
> Tim
> 
> On 7 Jan 2015, at 16:18, Andy <t3k...@gmail.com<mailto:t3k...@gmail.com>> 
> wrote:
> 
> Hi Tim.
> Please keep all discussions on the mailing list, as individual contributors 
> might not find the time to respond.
> With fixed random seeds, results should be reproducible. If you provide the 
> full code, it might be possible to say where the problem lies.
> 
> Sorry, I meant ShuffleSplitCV, but you can also use any other CV object.
> The cross_val_predict function is not super essential, just a convenience. 
> You should start with computing the scores using cross_val_score.
> 
> For otherwise maintained clusters: installing a version locally is super 
> easy. Just check out the dev version and set the pythonpath appropriately,
> or install into a virtual environment. Many people have custom python 
> libraries (or whole environments) installed.
> 
> Cheers,
> Andy
> 
> 
> On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote:
> Dear Andy,
> 
> Thank you for your reply on the scikit-learn problem I was having. Seeing as 
> I am new to this, I am writing to you directly; is this what I should do, or 
> should I reply to the response email that you gave me?
> 
> As for the reproducability, I have not set the probability to be True, so it 
> should be running on the default. I am also setting the random state 
> parameter, so I am puzzled as to what is happening. I haven't found a single 
> split that is reproducing high performing results. I understand that there 
> will be discrepancies in the data, but I don't understand why splits should 
> perform differently on different occasions.
> 
> Just another thing, I have noticed that the cross_val_predict function that 
> you mention is in the latest version of sklearn, but I cannot find the 
> RandomizedSplitCV one. Also, seeing as I am running my code on a cluster 
> which I do not maintain, I think it's probably best if I wait until 0.16 
> becomes the stable version before I ask the admins to update. Do you have any 
> idea of when this might be?
> 
> Thanks for your help and apologies if I am not supposed to contact your email 
> address directly,
> 
> Tim
> 
> 


------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to