ROC AUC doesn't use binary predictions as its input; it uses the measure of
confidence (or "decision function") that each sample should be assigned 1.
cross_val_score is correctly using decision_function to get these
continuous values, and you should find its results replicated by using
roc_auc_score(y_test, svc.decision_function(X_test)) rather than the
version with predict. Cheers, Joel
On 19 January 2015 at 21:45, Timothy Vivian-Griffiths <
vivian-griffith...@cardiff.ac.uk> wrote:
> Just in case this does appear twice, I am sending this for the second
> time as I have not seen it appear on the website archives, neither has it
> featured in the latest mail that I have received from this mailing list.
>
> This is really following on from the recent problems that I have been
> having trying to build an SVM classifier using genetic data to predict a
> binary phenotype outcome. The controls were labelled as 0, and the cases as
> 1, and the metric used to assess the performance was the roc_auc_score
> function from sklearn.metrics. The main aim of this exercise was to assess
> difference in performance across the ‘rbf’ and ‘linear’ kernels and 2 very
> different values of C - 1 and 1000.
>
> Andy advised me to use the cross_val_predict function from
> sklearn.cross_validation and compare the outcome with a permutation
> procedure that I was carrying out using train_test_split. This did lead to
> very different answers, and there was one model where the differences were
> particularly noticeable: with C=1 and kernel = ‘rbf’. When manually
> splitting the data, this was leading to the SVC always predicting controls
> (all 0s), and the roc score was therefore 0.5 for every different split.
> However, cross_val_score (with scoring = ‘roc_auc’) gave different answers
> (around 0.63 - what I was expecting)… so some of the predictions must have
> been cases.
>
> I read that cross_val_score uses StratifiedKFold to make the splits, so
> I decided to test performance using cross_val_score and a manual procedure
> using StratifiedKFold. Here is the code - for interest the inputs matrix
> had shape (7763, 125) and the target vector (125,):
>
>
> # Filename: compare_roc_scores.py
>
> """
> A quick script to compare the performance of the inbuilt cross_val_score
> function and manually carrying out the k-fold cross validation using
> StratifiedKFold
> """
>
> import numpy as np
> from sklearn.svm import SVC
> from sklearn.cross_validation import cross_val_score, StratifiedKFold
> from sklearn.metrics import roc_auc_score
>
> # Loading in the sample and feature matrix, and the target phenotypes
> inputs =
> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy')
> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy')
>
> # Setting up the classifier
> svc = SVC(C=1, kernel='rbf')
>
> # First carrying out 'manual' 3-Fold stratified cross validation
> stf = StratifiedKFold(targets, 3)
>
> manual_scores = []
>
> for train, test in stf:
> X_train, X_test = inputs[train], inputs[test]
> y_train, y_test = targets[train], targets[test]
> svc.fit(X_train, y_train)
> predicted = svc.predict(X_test)
> score = roc_auc_score(y_test, predicted)
> manual_scores.append(score)
>
> # Now carrying out the same procedure using 'cross_val_score'
> new_svc = SVC(C=1, kernel='rbf')
> scores = cross_val_score(new_svc, inputs, targets, cv=3,
> scoring='roc_auc')
>
> print 'Manual Scores:',
> for s in manual_scores:
> print s,
>
> print
>
> print 'Scores:',
> for s in scores:
> print s,
>
>
> And the output of this script is always:
>
> Manual Scores: 0.5 0.5 0.5
> Scores: 0.62113080937 0.625148733948 0.637621526518
>
> I have no idea why these should be different. Can anyone help with this?
>
> Tim
>
>
> ------------------------------------------------------------------------------
> New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
> GigeNET is offering a free month of service with a new server in Ashburn.
> Choose from 2 high performing configs, both with 100TB of bandwidth.
> Higher redundancy.Lower latency.Increased capacity.Completely compliant.
> http://p.sf.net/sfu/gigenet
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>
>
------------------------------------------------------------------------------
New Year. New Location. New Benefits. New Data Center in Ashburn, VA.
GigeNET is offering a free month of service with a new server in Ashburn.
Choose from 2 high performing configs, both with 100TB of bandwidth.
Higher redundancy.Lower latency.Increased capacity.Completely compliant.
http://p.sf.net/sfu/gigenet
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general