Dear Joel, Thank you for your reply, I used the decision_function and it did replicate. But I was wondering if someone could help me with this further. For this particular dataset and with these parameters (C=1, kernel='rbf'), the classifier is always outputting 0 for every sample. It is even doing this when I run the predict function on the training dataset that was used to build the model, which I find particularly strange. It is doing this for every train/test split that I have tried. This data is particularly noisy, so I am not expecting a very accurate outcome, but it does have an evenly matched group of classes (binary). It is however doing a better job with other parameters.
I have used this same dataset and parameters in Rs implementation of an SVM, and it is not outputting all 0s, so I don't think that it's a particular problem with the data. I appreciate that in this situation, I should just try the different parameters, but I would like to get an idea of what is going on. And I would much prefer to work in Python as I prefer the syntax, and it is far quicker! My problem is that all of my colleagues work with R, so I need help when things go wrong in Python. Tim > > ROC AUC doesn't use binary predictions as its input; it uses the measure of > confidence (or "decision function") that each sample should be assigned 1. > cross_val_score is correctly using decision_function to get these > continuous values, and you should find its results replicated by using > roc_auc_score(y_test, svc.decision_function(X_test)) rather than the > version with predict. Cheers, Joel > > On 19 January 2015 at 21:45, Timothy Vivian-Griffiths < > vivian-griffith...@cardiff.ac.uk> wrote: > >> Just in case this does appear twice, I am sending this for the second >> time as I have not seen it appear on the website archives, neither has it >> featured in the latest mail that I have received from this mailing list. >> >> This is really following on from the recent problems that I have been >> having trying to build an SVM classifier using genetic data to predict a >> binary phenotype outcome. The controls were labelled as 0, and the cases as >> 1, and the metric used to assess the performance was the roc_auc_score >> function from sklearn.metrics. The main aim of this exercise was to assess >> difference in performance across the ?rbf? and ?linear? kernels and 2 very >> different values of C - 1 and 1000. >> >> Andy advised me to use the cross_val_predict function from >> sklearn.cross_validation and compare the outcome with a permutation >> procedure that I was carrying out using train_test_split. This did lead to >> very different answers, and there was one model where the differences were >> particularly noticeable: with C=1 and kernel = ?rbf?. When manually >> splitting the data, this was leading to the SVC always predicting controls >> (all 0s), and the roc score was therefore 0.5 for every different split. >> However, cross_val_score (with scoring = ?roc_auc?) gave different answers >> (around 0.63 - what I was expecting)? so some of the predictions must have >> been cases. >> >> I read that cross_val_score uses StratifiedKFold to make the splits, so >> I decided to test performance using cross_val_score and a manual procedure >> using StratifiedKFold. Here is the code - for interest the inputs matrix >> had shape (7763, 125) and the target vector (125,): >> >> >> # Filename: compare_roc_scores.py >> >> """ >> A quick script to compare the performance of the inbuilt cross_val_score >> function and manually carrying out the k-fold cross validation using >> StratifiedKFold >> """ >> >> import numpy as np >> from sklearn.svm import SVC >> from sklearn.cross_validation import cross_val_score, StratifiedKFold >> from sklearn.metrics import roc_auc_score >> >> # Loading in the sample and feature matrix, and the target phenotypes >> inputs = >> np.load('stored_data_2015/125_GWAS/125_GWAS_combo_LOR_weighted_nan_removed_probabilistic_imputation.npy') >> targets = np.load('stored_data_2015/125_GWAS/125_GWAS_combo_phenotype.npy') >> >> # Setting up the classifier >> svc = SVC(C=1, kernel='rbf') >> >> # First carrying out 'manual' 3-Fold stratified cross validation >> stf = StratifiedKFold(targets, 3) >> >> manual_scores = [] >> >> for train, test in stf: >> X_train, X_test = inputs[train], inputs[test] >> y_train, y_test = targets[train], targets[test] >> svc.fit(X_train, y_train) >> predicted = svc.predict(X_test) >> score = roc_auc_score(y_test, predicted) >> manual_scores.append(score) >> >> # Now carrying out the same procedure using 'cross_val_score' >> new_svc = SVC(C=1, kernel='rbf') >> scores = cross_val_score(new_svc, inputs, targets, cv=3, >> scoring='roc_auc') >> >> print 'Manual Scores:', >> for s in manual_scores: >> print s, >> >> print >> >> print 'Scores:', >> for s in scores: >> print s, >> >> >> And the output of this script is always: >> >> Manual Scores: 0.5 0.5 0.5 >> Scores: 0.62113080937 0.625148733948 0.637621526518 >> >> I have no idea why these should be different. Can anyone help with this? >> >> Tim >> >> >> ------------------------------------------------------------------------------ >> New Year. New Location. New Benefits. New Data Center in Ashburn, VA. >> GigeNET is offering a free month of service with a new server in Ashburn. >> Choose from 2 high performing configs, both with 100TB of bandwidth. >> Higher redundancy.Lower latency.Increased capacity.Completely compliant. >> http://p.sf.net/sfu/gigenet >> _______________________________________________ >> Scikit-learn-general mailing list >> Scikit-learn-general@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general >> ------------------------------------------------------------------------------ New Year. New Location. New Benefits. New Data Center in Ashburn, VA. GigeNET is offering a free month of service with a new server in Ashburn. Choose from 2 high performing configs, both with 100TB of bandwidth. Higher redundancy.Lower latency.Increased capacity.Completely compliant. http://p.sf.net/sfu/gigenet _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general