I think that the fundamental problem is that you are using the default value of ntree (500). You should always use at least 1500 and more if n or p are large.
Also, this link will give you more up-to-date information on that package and feature selection: http://caret.r-forge.r-project.org/featureSelection.html Max On Tue, Jan 28, 2014 at 5:32 PM, Dimitri Liakhovitski < dimitri.liakhovit...@gmail.com> wrote: > Here is a great response I got from SO: > > There is an important difference between the two importance measures: > MeanDecreaseAccuracy is calculated using out of bag (OOB) data, > MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated > on observations not used to form that particular tree. In contrast, > MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are. > It is calculated using the same data used to fit trees. > > When you bootstrap data, you are creating multiple copies of the same > observations. Therefore the same observation can be split into two copies, > one to form a tree, and one treated as OOB and used to calculate accuracy > measures. Therefore, data that randomForest thinks is OOB for > MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample, > making the estimate of MeanDecreaseAccuracy overly optimistic in the > bootstrap iterations. Gini index is immune to this, because it is not > relying on evaluating importance on observations different from those used > to fit the data. > > I suspect what you are trying to do is use the bootstrap to generate > inference (p-values/confidence intervals) indicating which variables are > "important" in the sense that they are actually predictive of your outcome. > The bootstrap is not appropriate in this context, because Random Forests > expects that OOB data is truly OOB and this is important for building the > forest in the first place. In general, bootstrap is not universally > applicable, and is only useful in cases where it can be shown that the > parameter you're estimating has nice asymptotic properties and is not > sensitive to "ties" in the data. A procedure like Random Forest which > relies on the availability of OOB data is necessarily sensitive to ties. > > You may want to look at the caret package in R, which uses random forest > (or one of a set of many other algorithms) inside a cross-validation loop > to determine which variables are consistently important. See: > > > > > http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf > > > On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski < > dimitri.liakhovit...@gmail.com> wrote: > > > Thank you, Bert. I'll definitely ask there. > > In the meantime I just wanted to ensure that my R code (my function for > > bootstrap and the bootstrap run) is correct and my abnormal bootstrap > > results are not a function of my erroneous code. > > Thank you! > > > > > > > > On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter <gunter.ber...@gene.com > >wrote: > > > >> I **think** this kind of methodological issue might be better at SO > >> (stats.stackexchange.com). It's not really about R programming, which > >> is the main focus of this list. And yes, I know they do intersect. > >> Nevertheless... > >> > >> Cheers, > >> Bert > >> > >> Bert Gunter > >> Genentech Nonclinical Biostatistics > >> (650) 467-7374 > >> > >> "Data is not information. Information is not knowledge. And knowledge > >> is certainly not wisdom." > >> H. Gilbert Welch > >> > >> > >> > >> > >> On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski > >> <dimitri.liakhovit...@gmail.com> wrote: > >> > Hello! > >> > Below, I: > >> > 1. Create a data set with a bunch of factors. All of them are > predictors > >> > and 'y' is the dependent variable. > >> > 2. I run a classification Random Forests run with predictor > importance. > >> I > >> > look at 2 measures of importance - MeanDecreaseAccuracy and > >> MeanDecreaseGini > >> > 3. I run 2 boostrap runs for 2 Random Forests measures of importance > >> > mentioned above. > >> > > >> > Question: Could anyone please explain why I am getting such a huge > >> positive > >> > bias across the board (for all predictors) for MeanDecreaseAccuracy? > >> > > >> > Thanks a lot! > >> > Dimitri > >> > > >> > > >> > #---------------------------------------------------------------- > >> > # Creating a a data set: > >> > #------------------------------------------------------------- > >> > > >> > N<-1000 > >> > myset1<-c(1,2,3,4,5) > >> > probs1a<-c(.05,.10,.15,.40,.30) > >> > probs1b<-c(.05,.15,.10,.30,.40) > >> > probs1c<-c(.05,.05,.10,.15,.65) > >> > myset2<-c(1,2,3,4,5,6,7) > >> > probs2a<-c(.02,.03,.10,.15,.20,.30,.20) > >> > probs2b<-c(.02,.03,.10,.15,.20,.20,.30) > >> > probs2c<-c(.02,.03,.10,.10,.10,.25,.40) > >> > myset.y<-c(1,2) > >> > probs.y<-c(.65,.30) > >> > > >> > set.seed(1) > >> > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y)) > >> > set.seed(2) > >> > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a)) > >> > set.seed(3) > >> > b<-as.factor(sample(myset1, N, replace = TRUE,probs1b)) > >> > set.seed(4) > >> > c<-as.factor(sample(myset1, N, replace = TRUE,probs1c)) > >> > set.seed(5) > >> > d<-as.factor(sample(myset2, N, replace = TRUE,probs2a)) > >> > set.seed(6) > >> > e<-as.factor(sample(myset2, N, replace = TRUE,probs2b)) > >> > set.seed(7) > >> > f<-as.factor(sample(myset2, N, replace = TRUE,probs2c)) > >> > > >> > mydata<-data.frame(a,b,c,d,e,f,y) > >> > > >> > > >> > #------------------------------------------------------------- > >> > # Single Random Forests run with predictor importance. > >> > #------------------------------------------------------------- > >> > > >> > library(randomForest) > >> > set.seed(123) > >> > rf1<-randomForest(y~.,data=mydata,importance=T) > >> > importance(rf1)[,c(3:4)] > >> > > >> > #------------------------------------------------------------- > >> > # Bootstrapping run > >> > #------------------------------------------------------------- > >> > > >> > library(boot) > >> > > >> > ### Defining two functions to be used for bootstrapping: > >> > > >> > # myrf3 returns MeanDecreaseAccuracy: > >> > myrf3<-function(usedata,idx){ > >> > set.seed(123) > >> > out<-randomForest(y~.,data=usedata[idx,],importance=T) > >> > return(importance(out)[,3]) > >> > } > >> > > >> > # myrf4 returns MeanDecreaseGini: > >> > myrf4<-function(usedata,idx){ > >> > set.seed(123) > >> > out<-randomForest(y~.,data=usedata[idx,],importance=T) > >> > return(importance(out)[,4]) > >> > } > >> > > >> > ### 2 bootstrap runs: > >> > rfboot3<-boot(mydata,myrf3,R=10) > >> > rfboot4<-boot(mydata,myrf4,R=10) > >> > > >> > ### Results > >> > rfboot3 # for MeanDecreaseAccuracy > >> > colMeans(rfboot3$t)-importance(rf1)[,3] > >> > > >> > rfboot4 # for MeanDecreaseGini > >> > colMeans(rfboot4$t)-importance(rf1)[,4] # for MeanDecreaseGini > >> > > >> > -- > >> > Dimitri Liakhovitski > >> > > >> > [[alternative HTML version deleted]] > >> > > >> > ______________________________________________ > >> > R-help@r-project.org mailing list > >> > https://stat.ethz.ch/mailman/listinfo/r-help > >> > PLEASE do read the posting guide > >> http://www.R-project.org/posting-guide.html > >> > and provide commented, minimal, self-contained, reproducible code. > >> > > > > > > > > -- > > Dimitri Liakhovitski > > > > > > -- > Dimitri Liakhovitski > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.