[Scikit-learn-general] Different results using cross_val_score and StratifiedKFold

Timothy Vivian-Griffiths Mon, 19 Jan 2015 07:44:59 -0800

Dear Joel,

Thank you for your reply, I used the decision_function and it did replicate. 
But I was wondering if someone could help me with this further. For this 
particular dataset and with these parameters (C=1, kernel='rbf'), the 
classifier is always outputting 0 for every sample. It is even doing this when 
I run the predict function on the training dataset that was used to build the 
model, which I find particularly strange. It is doing this for every train/test 
split that I have tried. This data is particularly noisy, so I am not expecting 
a very accurate outcome, but it does have an evenly matched group of classes 
(binary). It is however doing a better job with other parameters.


I have used this same dataset and parameters in Rs implementation of an SVM, 
and it is not outputting all 0s, so I don't think that it's a particular 
problem with the data. 

I appreciate that in this situation, I should just try the different 
parameters, but I would like to get an idea of what is going on. And I would 
much prefer to work in Python as I prefer the syntax, and it is far quicker! My 
problem is that all of my colleagues work with R, so I need help when things go 
wrong in Python.

Tim


> 
> ROC AUC doesn't use binary predictions as its input; it uses the measure of
> confidence (or "decision function") that each sample should be assigned 1.
> cross_val_score is correctly using decision_function to get these
> continuous values, and you should find its results replicated by using
> roc_auc_score(y_test, svc.decision_function(X_test)) rather than the
> version with predict. Cheers, Joel
> 
> On 19 January 2015 at 21:45, Timothy Vivian-Griffiths <
> [email protected]> wrote:
> 
>> Just in case this does appear twice, I am sending this for the second
>> time as I have not seen it appear on the website archives, neither has it
>> featured in the latest mail that I have received from this mailing list.
>> 
>> This is really following on from the recent problems that I have been
>> having trying to build an SVM classifier using genetic data to predict a
>> binary phenotype outcome. The controls were labelled as 0, and the cases as
>> 1, and the metric used to assess the performance was the roc_auc_score
>> function from sklearn.metrics. The main aim of this exercise was to assess
>> difference in performance across the ?rbf? and ?linear? kernels and 2 very
>> different values of C - 1 and 1000.
>> 
>> Andy advised me to use the cross_val_predict function from
>> sklearn.cross_validation and compare the outcome with a permutation
>> procedure that I was carrying out using train_test_split. This did lead to
>> very different answers, and there was one model where the differences were
>> particularly noticeable: with C=1 and kernel = ?rbf?. When manually
>> splitting the data, this was leading to the SVC always predicting controls
>> (all 0s), and the roc score was therefore 0.5 for every different split.
>> However, cross_val_score (with scoring = ?roc_auc?) gave different answers
>> (around 0.63 - what I was expecting)? so some of the predictions must have
>> been cases.
>> 
>> I read that cross_val_score uses StratifiedKFold to make the splits, so
>> I decided to test performance using cross_val_score and a manual procedure
>> using StratifiedKFold. Here is the code - for interest the inputs matrix
>> had shape (7763, 125) and the target vector (125,):
>> 
>> 
>> # Filename: compare_roc_scores.py
>> 
>> """
>> A quick script to compare the performance of the inbuilt cross_val_score
>> function and manually carrying out the k-fold cross validation using
>> StratifiedKFold
>> """
>> 
>> import numpy as np
>> from sklearn.svm import SVC
>> from sklearn.cross_validation import cross_val_score, StratifiedKFold
>> from sklearn.metrics import roc_auc_score
>> 
>> # Loading in the sample and feature matrix, and the target phenotypes
>> inputs  =
>> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
>> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
>> 
>> # Setting up the classifier
>> svc     = SVC(C=1, kernel='rbf')
>> 
>> # First carrying out 'manual' 3-Fold stratified cross validation
>> stf     = StratifiedKFold(targets, 3)
>> 
>> manual_scores = []
>> 
>> for train, test in stf:
>>    X_train, X_test = inputs[train], inputs[test]
>>    y_train, y_test = targets[train], targets[test]
>>    svc.fit(X_train, y_train)
>>    predicted       = svc.predict(X_test)
>>    score           = roc_auc_score(y_test, predicted)
>>    manual_scores.append(score)
>> 
>> # Now carrying out the same procedure using 'cross_val_score'
>> new_svc = SVC(C=1, kernel='rbf')
>> scores  = cross_val_score(new_svc, inputs, targets, cv=3,
>> scoring='roc_auc')
>> 
>> print 'Manual Scores:',
>> for s in manual_scores:
>>    print s,
>> 
>> print
>> 
>> print 'Scores:',
>> for s in scores:
>>    print s,
>> 
>> 
>> And the output of this script is always:
>> 
>> Manual Scores: 0.5 0.5 0.5
>> Scores: 0.62113080937 0.625148733948 0.637621526518
>> 
>> I have no idea why these should be different. Can anyone help with this?
>> 
>> Tim
>> 
>> 
>> ------------------------------------------------------------------------------
>> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
>> GigeNET is offering a free month of service with a new server in Ashburn.
>> Choose from 2 high performing configs, both with 100TB of bandwidth.
>> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
>> http://p.sf.net/sfu/gigenet
>> _______________________________________________
>> Scikit-learn-general mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>> 


------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Different results using cross_val_score and StratifiedKFold

Reply via email to