Just out of curiosity, I took the default "iris" example in the RF helpfile... but seeing the admonition against using the formula interface for large data sets, I wanted to play around a bit to see how the various options affected the output. Found something interesting I couldn't find documentation for...
Just like the example... > set.seed(12) # to be sure I have reproducibility > form.rf<-randomForest(Species ~ ., data=iris) > form.rf Call: randomForest(formula = Species ~ ., data = iris) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 4 46 0.08 > long.rf<-randomForest(x=iris[,1:4],y=iris[,5]) > long.rf Call: randomForest(x = iris[, 1:4], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 3 47 0.06 (Now, if I had non-contiguous columns for predictors, I'd have to call it this way....) > long2.rf<-randomForest(x=iris[,c(1:4)],y=iris[,5]) > long2.rf Call: randomForest(x = iris[, c(1:4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 5.33% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 5 45 0.10 Any idea why these two should give different results? I can only figure that the seed, even though it's set, somehow gets altered by the use of a formula.... > long3.rf<-randomForest(x=iris[,c(1,2,3,4)],y=iris[,5]) > long3.rf Call: randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5]) Type of random forest: classification Number of trees: 500 No. of variables tried at each split: 2 OOB estimate of error rate: 4.67% Confusion matrix: setosa versicolor virginica class.error setosa 50 0 0 0.00 versicolor 0 47 3 0.06 virginica 0 4 46 0.08 Either that or I'm calling it wrong in the long example, or else there's a bug. Not a life threatening situation, but I am curious as to the mechanics of this. I use that sort of column identification all the time and it seems to work OK, but here I get different results (form.rf v. long.rf or long2.rf) or not (form.rf v. long3.rf) depending how I call the function. Any insights? -- --------------------------------------- David L. Van Brunt, Ph.D. mailto:[EMAIL PROTECTED] "If Tyranny and Oppression come to this land, it will be in the guise of fighting a foreign enemy." --James Madison [[alternative HTML version deleted]] ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.