Hi Timothy,

You are not setting random_state for train_test_split. Please check if this
fixes the problem.

- Joel

On 10 January 2015 at 01:57, Timothy Vivian-Griffiths <
vivian-griffith...@cardiff.ac.uk> wrote:

>  Ok, well once again, thank you for your reply. I will provide some of my
> code here and I hope that it helps. Just bear in mind that I have not had
> any formal training in programming, so my code will very much come under
> the category of 'PhDware'.
>
>  So, this was the class that I wrote that gets imported into my different
> analysis scripts:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *# Filename: Clozuk_Machine_Learning_saving_random_state.py """ A module
> that is to be imported into other Python scripts in order to run the
> different algorithms on the data. """ import numpy as np import
> numpy.random as npr from sklearn.svm import SVC from
> sklearn.cross_validation import train_test_split from sklearn.metrics
> import roc_auc_score class DimensionError(Exception):     pass      class
> clSVC(SVC):     """     ccSVC(inputs, results, save_path, C=1,
> kernel='rbf', tp=0.5)          Sets up a Support Vector Classifier using
> the SVC object from the sklearn     package.          For detailed list of
> parameters and attributes - see documentation on SVC     from scikit-learn
>          Additional Parameters     ---------------------     inputs : numpy
> array         A numpy array of all the input data to the model. Features
> represented         in columns and samples represented in rows. Must be a
> 2D array     results : numpy array         A one dimensional numpy array of
> the results of the data*
> *    save_path : string*
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *         The full path to where the answers for each permutation will be
> saved     C : float         A value representing how much importance is
> given to fitting the          training data versus regularisation. Higher C
> values mean that the model         is fitted to the data given, but could
> result in over-fitting     tp : float > 0 and <1         A value
> representing the proportion of the data used in the training         set
> with the remainder going to the test set     """     def __init__(self,
> inputs, results, save_path, C=1, kernel='rbf', tp=0.5):
> super(clSVC, self).__init__(C=C, kernel=kernel)         if
> len(inputs.shape)==2:             self.inputs = inputs         else:
>       raise DimensionError('Inputs array must be 2D')
> if len(results.shape)==1:             self.results = results         else:
>             raise DimensionError('Results array must be 1D')
> self.save_path = save_path                               if tp >0 or tp <
> 1:             self.tp <http://self.tp> = tp         else:
> raise ValueError('tp must be between 0 and 1. %0.2f entered' % tp)
>          def run_model(self, seed=None):         if seed==None:
> seed = npr.randint(10000)         npr.seed(seed)                  # Using
> this seed to set the random state as well         self.random_state = seed
>                  X_train, X_test, y_train, y_test =
> train_test_split(self.inputs, \
> self.results, train_size=self.tp <http://self.tp>)
> self.fit(X_train, y_train)         predictions = self.predict(X_test)
>   percentage_correct = (np.sum(predictions == y_test) /
> float(len(predictions))) * 100          score = roc_auc_score(y_test,
> predictions)                  # Saving every iteration just in case the
> walltime runs out         fi = open('%s.txt' % self.save_path, 'a')
> fi.write('%d %f %f\n' % (seed, percentage_correct, score))
> fi.close()         return (seed, percentage_correct, score)*
>
>
>
>
>  And here is an example of a script that imports and runs this:
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *# Filename: svm_125_GWAS_mp.py """ Reads in the genotypes and the
> phenotypes and then carries out the whole SVC on them """ import numpy as
> np from ClozukMachineLearning_saving_random_state import clSVC from
> multiprocessing.dummy import Pool as ThreadPool from sys import argv #
> Getting the parameter of C from the input and converting it to an integer #
> Also getting kernel script, C, kernel = argv C = int(C) # Reading in the
> inputs and the targets inputs =
> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
> # Now building the actual model svc = clSVC(inputs, targets, C=C,
> save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d'
> % (kernel, C), kernel=kernel, tp=0.75) # Now reading in the random seeds
> seeds = np.load('stored_data_2015/random_seeds.npy') # Now setting up the
> multiprocessing pools pool = ThreadPool(16) # And mapping the learning
> function across these pools answers = pool.map(svc.run_model, seeds) #
> Closing and joining the results pool.close() pool.join()*
>
>  Just to recap the problem that I am having for anyone new who might be
> reading this. The aim of this is to try and predict case/control status of
> a schizophrenia dataset given information on genotypes for the different
> subjects.  I am running this procedure 500 times for each dataset. Each
> time, the data is split into training and test (75%, 25% respectively) in
> different ways using the train_test_split function. Each split is
> determined by random seed integers that are read in from the file
> 'random_seeds.npy'. The intention was to train and test an SVC model each
> time, and then get a distribution of the scores. I wanted to see what the
> general performance was, as well as the effect of splitting the data into
> different training/test sets. For each permutation, the seed is used to
> determine the train_test_split and the random_state.
>
>  I thought that by using the same seeds this would result in the same
> performance for the individual splits when carried out multiple times.
> However, while this is happening for a smaller dataset with only 125
> features, on a larger one with over 31,000 features, it is not. For the
> majority of the splits, I am getting ROC scores of around 0.6, which is
> what I would expect; but in about 15% of the splits, the scores are grouped
> very highly, around 0.9. When I first saw this, I wanted to find out what
> was different about those particular train/test splits was causing the high
> scores, so I isolated the seeds that resulted in good performance, and ran
> the procedure on those alone. Instead of seeing a repetition of the high
> performance, I got a similar pattern that was happening before: most scores
> were around 0.6, and about 15% were around 0.9. So the performance for the
> same splits of the data, with the random_state being set the same for both
> runs, are giving different results.
>
>  I decided to look at some of the seeds which resulted in high scores
> both times, by running this procedure on them multiple times. Then I am
> getting what I was expecting: the same answer (around 0.6) repeated for
> each run of each seed.
>
>  So what is puzzling me is why any of these high scores appeared at all
> in the first place?
>
>  Tim
>
>  On 7 Jan 2015, at 16:18, Andy <t3k...@gmail.com> wrote:
>
> Hi Tim.
> Please keep all discussions on the mailing list, as individual
> contributors might not find the time to respond.
> With fixed random seeds, results should be reproducible. If you provide
> the full code, it might be possible to say where the problem lies.
>
> Sorry, I meant ShuffleSplitCV, but you can also use any other CV object.
> The cross_val_predict function is not super essential, just a convenience.
> You should start with computing the scores using cross_val_score.
>
> For otherwise maintained clusters: installing a version locally is super
> easy. Just check out the dev version and set the pythonpath appropriately,
> or install into a virtual environment. Many people have custom python
> libraries (or whole environments) installed.
>
> Cheers,
> Andy
>
>
> On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote:
>
> Dear Andy,
>
> Thank you for your reply on the scikit-learn problem I was having. Seeing
> as I am new to this, I am writing to you directly; is this what I should
> do, or should I reply to the response email that you gave me?
>
> As for the reproducability, I have not set the probability to be True, so
> it should be running on the default. I am also setting the random state
> parameter, so I am puzzled as to what is happening. I haven't found a
> single split that is reproducing high performing results. I understand that
> there will be discrepancies in the data, but I don't understand why splits
> should perform differently on different occasions.
>
> Just another thing, I have noticed that the cross_val_predict function
> that you mention is in the latest version of sklearn, but I cannot find the
> RandomizedSplitCV one. Also, seeing as I am running my code on a cluster
> which I do not maintain, I think it's probably best if I wait until 0.16
> becomes the stable version before I ask the admins to update. Do you have
> any idea of when this might be?
>
> Thanks for your help and apologies if I am not supposed to contact your
> email address directly,
>
> Tim
>
>
>
>
>
>
>
> ------------------------------------------------------------------------------
> Dive into the World of Parallel Programming! The Go Parallel Website,
> sponsored by Intel and developed in partnership with Slashdot Media, is
> your
> hub for all things parallel software development, from weekly thought
> leadership blogs to news, videos, case studies, tutorials and more. Take a
> look and join the conversation now. http://goparallel.sourceforge.net
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to