Dear All, I have been using the randomForest package for a couple of difficult prediction problems (which also share p >> n). The performance is good, but since all the variables in the data set are used, interpretation of what is going on is not easy, even after looking at variable importance as produced by the randomForest run.
I have tried a simple "variable selection" scheme, and it does seem to perform well (as judged by leave-one-out) but I am not sure if it makes any sense. The idea is, in a kind of backwards elimination, to eliminate one by one the variables with smallest importance (or all the ones with negative importance in one go) until the out-of-bag estimate of classification error becames larger than that of the previous model (or of the initial model). So nothing really new. But I haven't been able to find any comments in the literature about "simplification" of random forests. Any suggestions/comments? Best, Ram�n -- Ram�n D�az-Uriarte Bioinformatics Unit Centro Nacional de Investigaciones Oncol�gicas (CNIO) (Spanish National Cancer Center) Melchor Fern�ndez Almagro, 3 28029 Madrid (Spain) Fax: +-34-91-224-6972 Phone: +-34-91-224-6900 http://bioinfo.cnio.es/~rdiaz ______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
