Just in case this does appear twice, I am sending this for the second time as I
have not seen it appear on the website archives, neither has it featured in the
latest mail that I have received from this mailing list.
This is really following on from the recent problems that I have been having
trying to build an SVM classifier using genetic data to predict a binary
phenotype outcome. The controls were labelled as 0, and the cases as 1, and the
metric used to assess the performance was the roc_auc_score function from
sklearn.metrics. The main aim of this exercise was to assess difference in
performance across the ‘rbf’ and ‘linear’ kernels and 2 very different values
of C - 1 and 1000.
Andy advised me to use the cross_val_predict function from
sklearn.cross_validation and compare the outcome with a permutation procedure
that I was carrying out using train_test_split. This did lead to very different
answers, and there was one model where the differences were particularly
noticeable: with C=1 and kernel = ‘rbf’. When manually splitting the data, this
was leading to the SVC always predicting controls (all 0s), and the roc score
was therefore 0.5 for every different split. However, cross_val_score (with
scoring = ‘roc_auc’) gave different answers (around 0.63 - what I was
expecting)… so some of the predictions must have been cases.
I read that cross_val_score uses StratifiedKFold to make the splits, so I
decided to test performance using cross_val_score and a manual procedure using
StratifiedKFold. Here is the code - for interest the inputs matrix had shape
(7763, 125) and the target vector (125,):
# Filename: compare_roc_scores.py
"""
A quick script to compare the performance of the inbuilt cross_val_score
function and manually carrying out the k-fold cross validation using
StratifiedKFold
"""
import numpy as np
from sklearn.svm import SVC
from sklearn.cross_validation import cross_val_score, StratifiedKFold
from sklearn.metrics import roc_auc_score
# Loading in the sample and feature matrix, and the target phenotypes
inputs =
np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
# Setting up the classifier
svc = SVC(C=1, kernel='rbf')
# First carrying out 'manual' 3-Fold stratified cross validation
stf = StratifiedKFold(targets, 3)
manual_scores = []
for train, test in stf:
X_train, X_test = inputs[train], inputs[test]
y_train, y_test = targets[train], targets[test]
svc.fit(X_train, y_train)
predicted = svc.predict(X_test)
score = roc_auc_score(y_test, predicted)
manual_scores.append(score)
# Now carrying out the same procedure using 'cross_val_score'
new_svc = SVC(C=1, kernel='rbf')
scores = cross_val_score(new_svc, inputs, targets, cv=3, scoring='roc_auc')
print 'Manual Scores:',
for s in manual_scores:
print s,
print
print 'Scores:',
for s in scores:
print s,
And the output of this script is always:
Manual Scores: 0.5 0.5 0.5
Scores: 0.62113080937 0.625148733948 0.637621526518
I have no idea why these should be different. Can anyone help with this?
Tim
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general