Hi Max, Thanks very much for investigating and explaining that - your help and time is much appreciated.
So as I understand it, using classProbs=F in trainControl() will give me the same accuracy results as before. However, I was relying on the class probabilities to return ROC/sensitivity/specificity, using a custom function similar to twoClassSummary(). What I still don't quite understand is which accuracy values from train() I should trust: those using classProbs=T or classProbs=F? I'm using train() to compare different classification methods using several stats (accuracy, AUROC etc), but this issue means that suddenly SVM has got much worse (based on accuracy)! I guess this means that I should roll back to the earlier versions of caret and kernlab (which is a pain because then train often crashes with 'memory map' errors!)? Thanks, Andrew On 16/11/2013, at 09:59 , Max Kuhn <mxk...@gmail.com> wrote: > Or not! > > The issue with with kernlab. > > Background: SVM models do not naturally produce class probabilities. A > secondary model (via Platt) is fit to the raw model output and a > logistic function is used to translate the raw SVM output to > probability-like numbers (i.e. sum to zero, between 0 and 1). In > ksvm(), you need to use the option prob.model = TRUE to get that > second model. > > I discovered some time ago that there can be a discrepancy in the > predicted classes that naturally come from the SVM model and those > derived by using the class associated with the largest class > probability. This is most likely do to natural error in the secondary > probability model and should not be unexpected. > > That is the case for your data. In you use the same tuning parameters > as those suggested by train() and go straight to ksvm(): > >> newSVM <- ksvm(x = as.matrix(df[,-1]), > + y = df[,1], > + kernel = rbfdot(sigma = svm.m1$bestTune$.sigma), > + C = svm.m1$bestTune$.C, > + prob.model = TRUE) >> >> predict(newSVM, df[43,-1]) > [1] O32078 > 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676 >> predict(newSVM, df[43,-1], type = "probabilities") > O27479 O31403 O32057 O32059 O32060 O32078 > [1,] 0.08791826 0.05911645 0.2424997 0.1036943 0.06968587 0.1648394 > O32089 O32663 O32668 O32676 > [1,] 0.04890477 0.05210836 0.09838892 0.07284396 > > Note that, based on the probability model, the class with the largest > probability is O32057 (p = 0.24) while the basic SVM model predicts > O32078 (p = 0.16). > > Somebody (maybe me) saw this discrepancy and that led to me to follow this > rule: > > if(prob.model = TRUE) use the class with the maximum probability > else use the class prediction from ksvm(). > > Therefore: > >> predict(svm.m1, df[43,-1]) > [1] O32057 > 10 Levels: O27479 O31403 O32057 O32059 O32060 O32078 ... O32676 > > That change occurred between the two caret versions that you tested with. > > (On a side note, can also occur with ksvm() and rpart() if > cost-sensitive training is used because the class designation takes > into account the costs but the class probability predictions do not. I > alerted both package maintainers to the issue some time ago.) > > HTH, > > Max > > On Fri, Nov 15, 2013 at 1:56 PM, Max Kuhn <mxk...@gmail.com> wrote: >> I've looked into this a bit and the issue seems to be with caret. I've >> been looking at the svn check-ins and nothing stands out to me as the >> issue so far. The final models that are generated are the same and >> I'll try to figure out the difference. >> >> Two small notes: >> >> 1) you should set the seed to ensure reproducibility. >> 2) you really shouldn't use character stings with all numbers as >> factor levels with caret when you want class probabilities. It should >> give you a warning about this >> >> Max >> >> On Thu, Nov 14, 2013 at 7:31 PM, Andrew Digby <andrewdi...@mac.com> wrote: >>> >>> I'm using caret to assess classifier performance (and it's great!). >>> However, I've found that my results differ between R2.* and R3.* - reported >>> accuracies are reduced dramatically. I suspect that a code change to >>> kernlab ksvm may be responsible (see version 5.16-24 here: >>> http://cran.r-project.org/web/packages/caret/news.html). I get very >>> different results between caret_5.15-61 + kernlab_0.9-17 and caret_5.17-7 + >>> kernlab_0.9-19 (see below). >>> >>> Can anyone please shed any light on this? >>> >>> Thanks very much! >>> >>> >>> ### To replicate: >>> >>> require(repmis) # For downloading from https >>> df <- source_data('https://dl.dropboxusercontent.com/u/47973221/data.csv', >>> sep=',') >>> require(caret) >>> svm.m1 <- >>> train(df[,-1],df[,1],method='svmRadial',metric='Kappa',tunelength=5,trControl=trainControl(method='repeatedcv', >>> number=10, repeats=10, classProbs=TRUE)) >>> svm.m1 >>> sessionInfo() >>> >>> ### Results - R2.15.2 >>> >>>> svm.m1 >>> 1241 samples >>> 7 predictors >>> 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, >>> ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’ >>> >>> No pre-processing >>> Resampling: Cross-Validation (10 fold, repeated 10 times) >>> >>> Summary of sample sizes: 1116, 1116, 1114, 1118, 1118, 1119, ... >>> >>> Resampling results across tuning parameters: >>> >>> C Accuracy Kappa Accuracy SD Kappa SD >>> 0.25 0.684 0.63 0.0353 0.0416 >>> 0.5 0.729 0.685 0.0379 0.0445 >>> 1 0.756 0.716 0.0357 0.0418 >>> >>> Tuning parameter ‘sigma’ was held constant at a value of 0.247 >>> Kappa was used to select the optimal model using the largest value. >>> The final values used for the model were C = 1 and sigma = 0.247. >>>> sessionInfo() >>> R version 2.15.2 (2012-10-26) >>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit) >>> >>> locale: >>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] e1071_1.6-1 class_7.3-5 kernlab_0.9-17 repmis_0.2.4 >>> caret_5.15-61 reshape2_1.2.2 plyr_1.8 lattice_0.20-10 >>> foreach_1.4.0 cluster_1.14.3 >>> >>> loaded via a namespace (and not attached): >>> [1] codetools_0.2-8 compiler_2.15.2 digest_0.6.0 evaluate_0.4.3 >>> formatR_0.7 grid_2.15.2 httr_0.2 iterators_1.0.6 knitr_1.1 >>> RCurl_1.95-4.1 stringr_0.6.2 tools_2.15.2 >>> >>> ### Results - R3.0.2 >>> >>>> require(caret) >>>> svm.m1 <- >>>> train(df[,-1],df[,1],method=’svmRadial’,metric=’Kappa’,tunelength=5,trControl=trainControl(method=’repeatedcv’, >>>> number=10, repeats=10, classProbs=TRUE)) >>> Loading required package: class >>> Warning messages: >>> 1: closing unused connection 4 >>> (https://dl.dropboxusercontent.com/u/47973221/df.Rdata) >>> 2: executing %dopar% sequentially: no parallel backend registered >>>> svm.m1 >>> 1241 samples >>> 7 predictors >>> 10 classes: ‘O27479’, ‘O31403’, ‘O32057’, ‘O32059’, ‘O32060’, ‘O32078’, >>> ‘O32089’, ‘O32663’, ‘O32668’, ‘O32676’ >>> >>> No pre-processing >>> Resampling: Cross-Validation (10 fold, repeated 10 times) >>> >>> Summary of sample sizes: 1118, 1117, 1115, 1117, 1116, 1118, ... >>> >>> Resampling results across tuning parameters: >>> >>> C Accuracy Kappa Accuracy SD Kappa SD >>> 0.25 0.372 0.278 0.033 0.0371 >>> 0.5 0.39 0.297 0.0317 0.0358 >>> 1 0.399 0.307 0.0289 0.0323 >>> >>> Tuning parameter ‘sigma’ was held constant at a value of 0.2148907 >>> Kappa was used to select the optimal model using the largest value. >>> The final values used for the model were C = 1 and sigma = 0.215. >>>> sessionInfo() >>> R version 3.0.2 (2013-09-25) >>> Platform: x86_64-apple-darwin10.8.0 (64-bit) >>> >>> locale: >>> [1] en_NZ.UTF-8/en_NZ.UTF-8/en_NZ.UTF-8/C/en_NZ.UTF-8/en_NZ.UTF-8 >>> >>> attached base packages: >>> [1] stats graphics grDevices utils datasets methods base >>> >>> other attached packages: >>> [1] e1071_1.6-1 class_7.3-9 kernlab_0.9-19 repmis_0.2.6.2 >>> caret_5.17-7 reshape2_1.2.2 plyr_1.8 lattice_0.20-24 >>> foreach_1.4.1 cluster_1.14.4 >>> >>> loaded via a namespace (and not attached): >>> [1] codetools_0.2-8 compiler_3.0.2 digest_0.6.3 grid_3.0.2 >>> httr_0.2 iterators_1.0.6 RCurl_1.95-4.1 stringr_0.6.2 tools_3.0.2 >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >> >> >> >> -- >> >> Max > > > > -- > > Max ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.