Ravishankar, > I used Random Forest with a couple of data sets I had to predict for binary > response. In all the cases, the AUC of the training set is coming to be 1. > Is this always the case with random forests? Can someone please clarify > this?
This is pretty typical for this model. > I have given a simple example, first using logistic regression and then > using random forests to explain the problem. AUC of the random forest is > coming out to be 1. Logistic regression isn't as flexible as RF and some other methods, so the ROC curve is likely to be less than one, but much higher than it really is (since you are re-predicting the same data) For you example: > performance(prediction(train.predict,iris$Species),"auc")@y.values[[1]] [1] 0.9972 but using simple 10-fold CV: > library(caret) > ctrl <- trainControl(method = "cv", + classProbs = TRUE, + summaryFunction = twoClassSummary) > > set.seed(1) > cvEstimate <- train(Species ~ ., data = iris, + method = "glm", + metric = "ROC", + trControl = ctrl) Fitting: parameter=none Aggregating results Fitting model on full training set Warning messages: 1: glm.fit: fitted probabilities numerically 0 or 1 occurred 2: glm.fit: algorithm did not converge 3: glm.fit: fitted probabilities numerically 0 or 1 occurred 4: glm.fit: algorithm did not converge 5: glm.fit: fitted probabilities numerically 0 or 1 occurred > cvEstimate Call: train.formula(form = Species ~ ., data = iris, method = "glm", metric = "ROC", trControl = ctrl) 100 samples 4 predictors Pre-processing: Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... Resampling results Sens Spec ROC Sens SD Spec SD ROC SD 0.96 0.98 0.86 0.0843 0.0632 0.126 and for random forest: > set.seed(1) > rfEstimate <- train(Species ~ ., + data = iris, + method = "rf", + metric = "ROC", + tuneGrid = data.frame(.mtry = 2), + trControl = ctrl) Fitting: mtry=2 Aggregating results Selecting tuning parameters Fitting model on full training set > rfEstimate Call: train.formula(form = Species ~ ., data = iris, method = "rf", metric = "ROC", tuneGrid = data.frame(.mtry = 2), trControl = ctrl) 100 samples 4 predictors Pre-processing: Resampling: Cross-Validation (10 fold) Summary of sample sizes: 90, 90, 90, 90, 90, 90, ... Resampling results Sens Spec ROC Sens SD Spec SD ROC SD 0.94 0.92 0.898 0.0966 0.14 0.00632 Tuning parameter 'mtry' was held constant at a value of 2 -- Max ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.