Re: [R] Random Forest & Cross Validation

Max Kuhn Sun, 20 Feb 2011 11:49:44 -0800

> I am using randomForest package to do some prediction job on GWAS data. I
> firstly split the data into training and testing set (70% vs 30%), then
> using training set to grow the trees (ntree=100000). It looks that the OOB
> error in training set is good (<10%). However, it is not very good for the
> test set with a AUC only about 50%.


Did you do any feature selection in the training set? If so, you also
need to include that step in the cross-validation to get realistic
performance estimates (see Ambroise and McLachlan. Selection bias in
gene extraction on the basis of microarray gene-expression data.
Proceedings of the National Academy of Sciences (2002) vol. 99 (10)
pp. 6562-6566).

In the caret package, train() can be used to get cross-validation
estimates for RF and the sbf() function (for selection by filter) can
be used to include simple univariate filters in the CV procedure.

> Although some people said no cross-validation was necessary for RF, I still
> felt unsafe and thought a testing set is important. I felt really frustrated
> with the results.

CV is needed when you want an assessment of performance on a test set.
In this sense, RF is like any other method.

-- 

Max

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Random Forest & Cross Validation

Reply via email to