[Scikit-learn-general] Different results using cross_val_score and StratifiedKFold

Timothy Vivian-Griffiths Mon, 19 Jan 2015 02:47:09 -0800

Just in case this does appear twice, I am sending this for the second time as I 
have not seen it appear on the website archives, neither has it featured in the 
latest mail that I have received from this mailing list.


This is really following on from the recent problems that I have been having 
trying to build an SVM classifier using genetic data to predict a binary 
phenotype outcome. The controls were labelled as 0, and the cases as 1, and the 
metric used to assess the performance was the roc_auc_score function from 
sklearn.metrics. The main aim of this exercise was to assess difference in 
performance across the ‘rbf’ and ‘linear’ kernels and 2 very different values 
of C - 1 and 1000.

Andy advised me to use the cross_val_predict function from 
sklearn.cross_validation and compare the outcome with a permutation procedure 
that I was carrying out using train_test_split. This did lead to very different 
answers, and there was one model where the differences were particularly 
noticeable: with C=1 and kernel = ‘rbf’. When manually splitting the data, this 
was leading to the SVC always predicting controls (all 0s), and the roc score 
was therefore 0.5 for every different split. However, cross_val_score (with 
scoring = ‘roc_auc’) gave different answers (around 0.63 - what I was 
expecting)… so some of the predictions must have been cases.

I read that cross_val_score uses StratifiedKFold to make the splits, so I 
decided to test performance using cross_val_score and a manual procedure using 
StratifiedKFold. Here is the code - for interest the inputs matrix had shape 
(7763, 125) and the target vector (125,):


# Filename: compare_roc_scores.py

"""
A quick script to compare the performance of the inbuilt cross_val_score
function and manually carrying out the k-fold cross validation using
StratifiedKFold
"""

import numpy as np
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score

# Loading in the sample and feature matrix, and the target phenotypes
inputs  = 
np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')

# Setting up the classifier
svc     = SVC(C=1, kernel='rbf')

# First carrying out 'manual' 3-Fold stratified cross validation
stf     = StratifiedKFold(targets, 3)

manual_scores = []

for train, test in stf:
    X_train, X_test = inputs[train], inputs[test]
    y_train, y_test = targets[train], targets[test]
    svc.fit(X_train, y_train)
    predicted       = svc.predict(X_test)
    score           = roc_auc_score(y_test, predicted)
    manual_scores.append(score)

# Now carrying out the same procedure using 'cross_val_score'
new_svc = SVC(C=1, kernel='rbf')
scores  = cross_val_score(new_svc, inputs, targets, cv=3, scoring='roc_auc')

print 'Manual Scores:',
for s in manual_scores:
    print s,

print

print 'Scores:',
for s in scores:
    print s,


And the output of this script is always:

Manual Scores: 0.5 0.5 0.5
Scores: 0.62113080937 0.625148733948 0.637621526518

I have no idea why these should be different. Can anyone help with this?

Tim

------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet

_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

[Scikit-learn-general] Different results using cross_val_score and StratifiedKFold

Reply via email to