Re: [Scikit-learn-general] Thanks for help

Timothy Vivian-Griffiths Fri, 09 Jan 2015 06:59:36 -0800

Ok, well once again, thank you for your reply. I will provide some of my code 
here and I hope that it helps. Just bear in mind that I have not had any formal 
training in programming, so my code will very much come under the category of 
'PhDware'.


So, this was the class that I wrote that gets imported into my different 
analysis scripts:

 # Filename: Clozuk_Machine_Learning_saving_random_state.py

"""
A module that is to be imported into other Python scripts in order to run
the different algorithms on the data.
"""

import numpy as np
import numpy.random as npr
from sklearn.svm import SVC
from sklearn.cross_validation import train_test_split
from sklearn.metrics import roc_auc_score

class DimensionError(Exception):
    pass

class clSVC(SVC):
    """
    ccSVC(inputs, results, save_path, C=1, kernel='rbf', tp=0.5)

    Sets up a Support Vector Classifier using the SVC object from the sklearn
    package.

    For detailed list of parameters and attributes - see documentation on SVC
    from scikit-learn

    Additional Parameters
    ---------------------
    inputs : numpy array
        A numpy array of all the input data to the model. Features represented
        in columns and samples represented in rows. Must be a 2D array
    results : numpy array
        A one dimensional numpy array of the results of the data
    save_path : string
         The full path to where the answers for each permutation will be saved
    C : float
        A value representing how much importance is given to fitting the
        training data versus regularisation. Higher C values mean that the model
        is fitted to the data given, but could result in over-fitting
    tp : float > 0 and <1
        A value representing the proportion of the data used in the training
        set with the remainder going to the test set
    """

    def __init__(self, inputs, results, save_path, C=1, kernel='rbf', tp=0.5):
        super(clSVC, self).__init__(C=C, kernel=kernel)
        if len(inputs.shape)==2:
            self.inputs = inputs
        else:
            raise DimensionError('Inputs array must be 2D')

        if len(results.shape)==1:
            self.results = results
        else:
            raise DimensionError('Results array must be 1D')

        self.save_path = save_path


        if tp >0 or tp < 1:
            self.tp = tp
        else:
            raise ValueError('tp must be between 0 and 1. %0.2f entered' % tp)


    def run_model(self, seed=None):
        if seed==None:
            seed = npr.randint(10000)
        npr.seed(seed)

        # Using this seed to set the random state as well
        self.random_state = seed

        X_train, X_test, y_train, y_test = train_test_split(self.inputs, \
                                            self.results, train_size=self.tp)
        self.fit(X_train, y_train)
        predictions = self.predict(X_test)
        percentage_correct = (np.sum(predictions == y_test) / 
float(len(predictions))) * 100
        score = roc_auc_score(y_test, predictions)

        # Saving every iteration just in case the walltime runs out
        fi = open('%s.txt' % self.save_path, 'a')
        fi.write('%d %f %f\n' % (seed, percentage_correct, score))
        fi.close()
        return (seed, percentage_correct, score)




And here is an example of a script that imports and runs this:

# Filename: svm_125_GWAS_mp.py

"""
Reads in the genotypes and the phenotypes
and then carries out the whole SVC on them
"""

import numpy as np
from ClozukMachineLearning_saving_random_state import clSVC
from multiprocessing.dummy import Pool as ThreadPool
from sys import argv

# Getting the parameter of C from the input and converting it to an integer
# Also getting kernel
script, C, kernel = argv
C = int(C)

# Reading in the inputs and the targets
inputs = 
np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')

# Now building the actual model
svc = clSVC(inputs, targets, C=C, 
save_path='svm_answers_2015/125_GWAS/500_permutations_125_GWAS_%s_svm_probabilistic_imputation_C_%d'
 % (kernel, C), kernel=kernel, tp=0.75)

# Now reading in the random seeds
seeds = np.load('stored_data_2015/random_seeds.npy')

# Now setting up the multiprocessing pools
pool = ThreadPool(16)

# And mapping the learning function across these pools
answers = pool.map(svc.run_model, seeds)

# Closing and joining the results
pool.close()
pool.join()

Just to recap the problem that I am having for anyone new who might be reading 
this. The aim of this is to try and predict case/control status of a 
schizophrenia dataset given information on genotypes for the different 
subjects.  I am running this procedure 500 times for each dataset. Each time, 
the data is split into training and test (75%, 25% respectively) in different 
ways using the train_test_split function. Each split is determined by random 
seed integers that are read in from the file 'random_seeds.npy'. The intention 
was to train and test an SVC model each time, and then get a distribution of 
the scores. I wanted to see what the general performance was, as well as the 
effect of splitting the data into different training/test sets. For each 
permutation, the seed is used to determine the train_test_split and the 
random_state.

I thought that by using the same seeds this would result in the same 
performance for the individual splits when carried out multiple times. However, 
while this is happening for a smaller dataset with only 125 features, on a 
larger one with over 31,000 features, it is not. For the majority of the 
splits, I am getting ROC scores of around 0.6, which is what I would expect; 
but in about 15% of the splits, the scores are grouped very highly, around 0.9. 
When I first saw this, I wanted to find out what was different about those 
particular train/test splits was causing the high scores, so I isolated the 
seeds that resulted in good performance, and ran the procedure on those alone. 
Instead of seeing a repetition of the high performance, I got a similar pattern 
that was happening before: most scores were around 0.6, and about 15% were 
around 0.9. So the performance for the same splits of the data, with the 
random_state being set the same for both runs, are giving different results.

I decided to look at some of the seeds which resulted in high scores both 
times, by running this procedure on them multiple times. Then I am getting what 
I was expecting: the same answer (around 0.6) repeated for each run of each 
seed.

So what is puzzling me is why any of these high scores appeared at all in the 
first place?

Tim

On 7 Jan 2015, at 16:18, Andy <[email protected]<mailto:[email protected]>> wrote:

Hi Tim.
Please keep all discussions on the mailing list, as individual contributors 
might not find the time to respond.
With fixed random seeds, results should be reproducible. If you provide the 
full code, it might be possible to say where the problem lies.

Sorry, I meant ShuffleSplitCV, but you can also use any other CV object.
The cross_val_predict function is not super essential, just a convenience. You 
should start with computing the scores using cross_val_score.

For otherwise maintained clusters: installing a version locally is super easy. 
Just check out the dev version and set the pythonpath appropriately,
or install into a virtual environment. Many people have custom python libraries 
(or whole environments) installed.

Cheers,
Andy


On 01/07/2015 05:42 AM, Timothy Vivian-Griffiths wrote:
Dear Andy,

Thank you for your reply on the scikit-learn problem I was having. Seeing as I 
am new to this, I am writing to you directly; is this what I should do, or 
should I reply to the response email that you gave me?

As for the reproducability, I have not set the probability to be True, so it 
should be running on the default. I am also setting the random state parameter, 
so I am puzzled as to what is happening. I haven't found a single split that is 
reproducing high performing results. I understand that there will be 
discrepancies in the data, but I don't understand why splits should perform 
differently on different occasions.

Just another thing, I have noticed that the cross_val_predict function that you 
mention is in the latest version of sklearn, but I cannot find the 
RandomizedSplitCV one. Also, seeing as I am running my code on a cluster which 
I do not maintain, I think it's probably best if I wait until 0.16 becomes the 
stable version before I ask the admins to update. Do you have any idea of when 
this might be?

Thanks for your help and apologies if I am not supposed to contact your email 
address directly,

Tim

------------------------------------------------------------------------------
Dive into the World of Parallel Programming! The Go Parallel Website,
sponsored by Intel and developed in partnership with Slashdot Media, is your
hub for all things parallel software development, from weekly thought
leadership blogs to news, videos, case studies, tutorials and more. Take a
look and join the conversation now. http://goparallel.sourceforge.net

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Thanks for help

Reply via email to