Just out of curiosity, I took the default "iris" example in the RF
helpfile...
but seeing the admonition against using the formula interface for large data
sets, I wanted to play around a bit to see how the various options affected
the output. Found something interesting I couldn't find documentation for...

Just like the example...
> set.seed(12) # to be sure I have reproducibility

> form.rf<-randomForest(Species ~ ., data=iris)
> form.rf

Call:
 randomForest(formula = Species ~ ., data = iris)
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.67%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          4        46        0.08

> long.rf<-randomForest(x=iris[,1:4],y=iris[,5])
> long.rf
Call:
 randomForest(x = iris[, 1:4], y = iris[, 5])
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          3        47        0.06


(Now, if I had non-contiguous columns for predictors, I'd have to call it
this way....)

> long2.rf<-randomForest(x=iris[,c(1:4)],y=iris[,5])
> long2.rf

Call:
 randomForest(x = iris[, c(1:4)], y = iris[, 5])
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 5.33%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          5        45        0.10


Any idea why these two should give different results? I can only figure that
the seed, even though it's set, somehow gets altered by the use of a
formula....
> long3.rf<-randomForest(x=iris[,c(1,2,3,4)],y=iris[,5])
> long3.rf

Call:
 randomForest(x = iris[, c(1, 2, 3, 4)], y = iris[, 5])
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.67%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         50          0         0        0.00
versicolor      0         47         3        0.06
virginica       0          4        46        0.08


Either that or I'm calling it wrong in the long example, or else there's a
bug. Not a life threatening situation, but I am curious as to the mechanics
of this. I use that sort of column identification all the time and it seems
to work OK, but here I get different results (form.rf v. long.rf or long2.rf)
or not (form.rf v. long3.rf)  depending how I call the function. Any
insights?


-- 
---------------------------------------
David L. Van Brunt, Ph.D.
mailto:[EMAIL PROTECTED]

"If Tyranny and Oppression come to this land, it will be in the guise of
fighting a foreign enemy."
--James Madison

        [[alternative HTML version deleted]]

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to