Re: [R] Predictor Importance in Random Forests and bootstrap

Max Kuhn Tue, 28 Jan 2014 14:52:25 -0800

I think that the fundamental problem is that you are using the default
value of ntree (500). You should always use at least 1500 and more if n or
p are large.


Also, this link will give you more up-to-date information on that package
and feature selection:

http://caret.r-forge.r-project.org/featureSelection.html

Max


On Tue, Jan 28, 2014 at 5:32 PM, Dimitri Liakhovitski <
dimitri.liakhovit...@gmail.com> wrote:

> Here is a great response I got from SO:
>
> There is an important difference between the two importance measures:
> MeanDecreaseAccuracy is calculated using out of bag (OOB) data,
> MeanDecreaseGini is not. For each tree MeanDecreaseAccuracy is calculated
> on observations not used to form that particular tree. In contrast,
> MeanDecreaseGini is a summary of how impure the leaf nodes of a tree are.
> It is calculated using the same data used to fit trees.
>
> When you bootstrap data, you are creating multiple copies of the same
> observations. Therefore the same observation can be split into two copies,
> one to form a tree, and one treated as OOB and used to calculate accuracy
> measures. Therefore, data that randomForest thinks is OOB for
> MeanDecreaseAccuracy is not necessarily truly OOB in your bootstrap sample,
> making the estimate of MeanDecreaseAccuracy overly optimistic in the
> bootstrap iterations. Gini index is immune to this, because it is not
> relying on evaluating importance on observations different from those used
> to fit the data.
>
> I suspect what you are trying to do is use the bootstrap to generate
> inference (p-values/confidence intervals) indicating which variables are
> "important" in the sense that they are actually predictive of your outcome.
> The bootstrap is not appropriate in this context, because Random Forests
> expects that OOB data is truly OOB and this is important for building the
> forest in the first place. In general, bootstrap is not universally
> applicable, and is only useful in cases where it can be shown that the
> parameter you're estimating has nice asymptotic properties and is not
> sensitive to "ties" in the data. A procedure like Random Forest which
> relies on the availability of OOB data is necessarily sensitive to ties.
>
> You may want to look at the caret package in R, which uses random forest
> (or one of a set of many other algorithms) inside a cross-validation loop
> to determine which variables are consistently important. See:
>
>
>
>
> http://cran.open-source-solution.org/web/packages/caret/vignettes/caretSelection.pdf
>
>
> On Tue, Jan 28, 2014 at 8:54 AM, Dimitri Liakhovitski <
> dimitri.liakhovit...@gmail.com> wrote:
>
> > Thank you, Bert. I'll definitely ask there.
> > In the meantime I just wanted to ensure that my R code (my function for
> > bootstrap and the bootstrap run) is correct and my abnormal bootstrap
> > results are not a function of my erroneous code.
> > Thank you!
> >
> >
> >
> > On Mon, Jan 27, 2014 at 7:09 PM, Bert Gunter <gunter.ber...@gene.com
> >wrote:
> >
> >> I **think** this kind of methodological issue might be better at SO
> >> (stats.stackexchange.com).  It's not really about R programming, which
> >> is the main focus of this list. And yes, I know they do intersect.
> >> Nevertheless...
> >>
> >> Cheers,
> >> Bert
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >> (650) 467-7374
> >>
> >> "Data is not information. Information is not knowledge. And knowledge
> >> is certainly not wisdom."
> >> H. Gilbert Welch
> >>
> >>
> >>
> >>
> >> On Mon, Jan 27, 2014 at 3:47 PM, Dimitri Liakhovitski
> >> <dimitri.liakhovit...@gmail.com> wrote:
> >> > Hello!
> >> > Below, I:
> >> > 1. Create a data set with a bunch of factors. All of them are
> predictors
> >> > and 'y' is the dependent variable.
> >> > 2. I run a classification Random Forests run with predictor
> importance.
> >> I
> >> > look at 2 measures of importance - MeanDecreaseAccuracy and
> >> MeanDecreaseGini
> >> > 3. I run 2 boostrap runs for 2 Random Forests measures of importance
> >> > mentioned above.
> >> >
> >> > Question: Could anyone please explain why I am getting such a huge
> >> positive
> >> > bias across the board (for all predictors) for MeanDecreaseAccuracy?
> >> >
> >> > Thanks a lot!
> >> > Dimitri
> >> >
> >> >
> >> > #----------------------------------------------------------------
> >> > # Creating a a data set:
> >> > #-------------------------------------------------------------
> >> >
> >> > N<-1000
> >> > myset1<-c(1,2,3,4,5)
> >> > probs1a<-c(.05,.10,.15,.40,.30)
> >> > probs1b<-c(.05,.15,.10,.30,.40)
> >> > probs1c<-c(.05,.05,.10,.15,.65)
> >> > myset2<-c(1,2,3,4,5,6,7)
> >> > probs2a<-c(.02,.03,.10,.15,.20,.30,.20)
> >> > probs2b<-c(.02,.03,.10,.15,.20,.20,.30)
> >> > probs2c<-c(.02,.03,.10,.10,.10,.25,.40)
> >> > myset.y<-c(1,2)
> >> > probs.y<-c(.65,.30)
> >> >
> >> > set.seed(1)
> >> > y<-as.factor(sample(myset.y,N,replace=TRUE,probs.y))
> >> > set.seed(2)
> >> > a<-as.factor(sample(myset1, N, replace = TRUE,probs1a))
> >> > set.seed(3)
> >> > b<-as.factor(sample(myset1, N, replace = TRUE,probs1b))
> >> > set.seed(4)
> >> > c<-as.factor(sample(myset1, N, replace = TRUE,probs1c))
> >> > set.seed(5)
> >> > d<-as.factor(sample(myset2, N, replace = TRUE,probs2a))
> >> > set.seed(6)
> >> > e<-as.factor(sample(myset2, N, replace = TRUE,probs2b))
> >> > set.seed(7)
> >> > f<-as.factor(sample(myset2, N, replace = TRUE,probs2c))
> >> >
> >> > mydata<-data.frame(a,b,c,d,e,f,y)
> >> >
> >> >
> >> > #-------------------------------------------------------------
> >> > # Single Random Forests run with predictor importance.
> >> > #-------------------------------------------------------------
> >> >
> >> > library(randomForest)
> >> > set.seed(123)
> >> > rf1<-randomForest(y~.,data=mydata,importance=T)
> >> > importance(rf1)[,c(3:4)]
> >> >
> >> > #-------------------------------------------------------------
> >> > # Bootstrapping run
> >> > #-------------------------------------------------------------
> >> >
> >> > library(boot)
> >> >
> >> > ### Defining two functions to be used for bootstrapping:
> >> >
> >> > # myrf3 returns MeanDecreaseAccuracy:
> >> > myrf3<-function(usedata,idx){
> >> >   set.seed(123)
> >> >   out<-randomForest(y~.,data=usedata[idx,],importance=T)
> >> >   return(importance(out)[,3])
> >> > }
> >> >
> >> > # myrf4 returns MeanDecreaseGini:
> >> > myrf4<-function(usedata,idx){
> >> >   set.seed(123)
> >> >   out<-randomForest(y~.,data=usedata[idx,],importance=T)
> >> >   return(importance(out)[,4])
> >> > }
> >> >
> >> > ### 2 bootstrap runs:
> >> > rfboot3<-boot(mydata,myrf3,R=10)
> >> > rfboot4<-boot(mydata,myrf4,R=10)
> >> >
> >> > ### Results
> >> > rfboot3   # for MeanDecreaseAccuracy
> >> > colMeans(rfboot3$t)-importance(rf1)[,3]
> >> >
> >> > rfboot4   # for MeanDecreaseGini
> >> > colMeans(rfboot4$t)-importance(rf1)[,4]   # for MeanDecreaseGini
> >> >
> >> > --
> >> > Dimitri Liakhovitski
> >> >
> >> >         [[alternative HTML version deleted]]
> >> >
> >> > ______________________________________________
> >> > R-help@r-project.org mailing list
> >> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> > PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> > and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Dimitri Liakhovitski
> >
>
>
>
> --
> Dimitri Liakhovitski
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Predictor Importance in Random Forests and bootstrap

Reply via email to