I'm been experimenting with the randomForest R package (v. 4.6-12) and getting
an unexpected difference between rpart and randomForest results that may have
something to do with using x's that are factors.
The same model (see code below) is used to predict a 2-value variable called
"resp" that is treated as a factor. Four x's are used that are factors.
The rpart predicted probabilities average to the same as mean(resp) when used
on the full dataset. This seems OK.
The randomForest predicted probabilities average is quite a bit different from
mean(resp). This seems unexpected since random forests amount to repeatedly
doing variations of what rpart does.
Has anyone seen anything like this or see what I am doing wrong?
(I did the same comparison using the kyphosis dataset in rpart with all
continuous predictors and found consistent average predicted probabilities
between rpart and randomForest.)
Here's the code ...
require(PracTools) # R package with dataset used
require(rpart)
require(randomForest)
data(nhis) # dataset in PracTools
table(nhis$resp)/nrow(nhis)
# 0 1
#0.3098952 0.6901048
t1 <- rpart(resp ~ age + as.factor(hisp) + as.factor(race) +
as.factor(parents_r) + as.factor(educ_r),
method = "class",
control = rpart.control(minbucket = 50, cp=0),
data = nhis)
rpart.prob <- predict(object = t1, newdata = nhis, type = "prob")
apply(rpart.prob,2,mean)
# 0 1
#0.3098952 0.6901048 mean of rpart predictions same as mean(resp)
rf.nhis <- randomForest(as.factor(resp) ~ age + as.factor(hisp) +
as.factor(race)
+ as.factor(parents_r) + as.factor(educ_r),
importance = TRUE, na.action = na.omit, mtry=5,
ntree = 1000, classwt = c(0.31, 0.69),
# cycled through mtry =1,...,5; the lower mtry is, the
worse are the predicted probs
data = nhis)
rfnhis.prob <- predict(object = rf.nhis, newdata = nhis, type = "prob")
apply(rfnhis.prob,2,mean)
# 0 1
#0.2485541 0.7514459 not too close to mean(resp)
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
randomForest_4.6-12
Thanks for any help,
Richard Valliant
Universities of Maryland and Michigan
______________________________________________
[email protected] mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.