> I am using randomForest package to do some prediction job on GWAS data. I > firstly split the data into training and testing set (70% vs 30%), then > using training set to grow the trees (ntree=100000). It looks that the OOB > error in training set is good (<10%). However, it is not very good for the > test set with a AUC only about 50%.
Did you do any feature selection in the training set? If so, you also need to include that step in the cross-validation to get realistic performance estimates (see Ambroise and McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the National Academy of Sciences (2002) vol. 99 (10) pp. 6562-6566). In the caret package, train() can be used to get cross-validation estimates for RF and the sbf() function (for selection by filter) can be used to include simple univariate filters in the CV procedure. > Although some people said no cross-validation was necessary for RF, I still > felt unsafe and thought a testing set is important. I felt really frustrated > with the results. CV is needed when you want an assessment of performance on a test set. In this sense, RF is like any other method. -- Max ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.