Hi, Frank, If I want to do prediction as well as to select important predictors, which may be the best function to use when I have 35 samples and 35 predictors (penalized logistic with variable selection)? I saw there is a 'fastbw' function in the Design package. And there is a 'step.plr' function in the 'stepPlr' package.
Thank you, Annie On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr < f.harr...@vanderbilt.edu> wrote: > annie Zhang wrote: > >> Thank you for all your reply. >> Actually as Bert said, besides predicion, I also need variable selection >> (I need to know which variables are important). As far as the sample size >> and number of variables, both of them are small around 35. How can I get >> accurate prediction as long as good predictors? >> Annie >> > > It is next to impossible to find a unique list of 'important' variables > without having 50 times as many subjects as potential predictors, unless > your signal:noise ratio is stunning. > > Frank > > >> On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter <gunter.ber...@gene.com<mailto: >> gunter.ber...@gene.com>> wrote: >> >> But let's be clear here folks: >> >> Ben's comment is apropos: ""As many variables as samples" is >> particularly >> scary." >> >> (Aside -- how much scarier then are -omics analyses in which the >> number of >> variables is thousands of times the number of samples?) >> >> Sensible penalization (it's usually not too sensitive to the details) >> is >> only another way of obtaining a parsimonious model with good (in the >> sense >> of minimizing overall prediction error: bias + variance) prediction >> properties. Alas, this is often not what scientists want: they use >> variable >> selection to find the "right" covariates, the "most important" >> variables >> affecting the response. But this is beyond the power of empirical >> modeling >> here: "as many variables as samples" almost guarantees that there >> will be >> many different and even nonoverlapping subsets of variables that >> are, within >> statistical noise, equally "optimal" predictors. That is, variable >> selection >> in such circumstances is just a pretty sophisticated random number >> generator >> -- ergo Frank's Draconian warnings. Penalization produces better >> prediction >> engines with better properties, but it cannot overcome the "as many >> variables as samples" problem either. Entropy rules. If what is >> sought is a >> way to determine the "truly important" variables, then the study must >> be >> designed to provide the information to do so. You don't get >> something for >> nothing. >> >> Cheers, >> >> Bert Gunter >> Genentech Nonclinical Biostatistics >> >> >> -----Original Message----- >> From: r-help-boun...@r-project.org >> <mailto:r-help-boun...@r-project.org> >> [mailto:r-help-boun...@r-project.org >> <mailto:r-help-boun...@r-project.org>] On >> Behalf Of Frank E Harrell Jr >> Sent: Wednesday, September 02, 2009 9:07 PM >> To: annie Zhang >> Cc: r-help@r-project.org <mailto:r-help@r-project.org> >> Subject: Re: [R] variable selection in logistic >> >> annie Zhang wrote: >> > Hi, Frank, >> > >> > You mean the backward and forward stepwise selection is bad? You >> also >> > suggest the penalized logistic regression is the best choice? Is >> there >> > any function to do it as well as selecting the best penalty? >> > >> > Annie >> >> All variable selection is bad unless its in the context of >> penalization. >> You'll need penalized logistic regression not necessarily with >> variable selection, for example a quadratic penalty as in a case study >> in my book, or an L1 penalty (lasso) using other packages. >> >> Frank >> >> > >> > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr >> > <f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu> >> <mailto:f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>>> >> >> wrote: >> > >> > David Winsemius wrote: >> > >> > >> > On Sep 2, 2009, at 9:36 PM, annie Zhang wrote: >> > >> > Hi, R users, >> > >> > What may be the best function in R to do variable >> selection >> > in logistic >> > regression? >> > >> > >> > PhD theses, and books by famous statisticians have been >> pursuing >> > the answer to that question for decades. >> > >> > I have the same number of variables as the number of >> samples, >> > and I want to select the best variablesfor prediction. >> Is >> > there any function >> > doing forward selection followed by backward >> elimination in >> > stepwise >> > logistic regression? >> > >> > >> > You should probably be reading up on penalized regression >> > methods. The stepwise procedures reporting unadjusted >> > "significance" made available by SAS and SPSS to the unwary >> > neophyte user have very poor statistical properties. >> > >> > -- >> > >> > David Winsemius, MD >> > >> > >> > Amen to that. >> > >> > Annie, resist the temptation. These methods bite. >> > >> > Frank >> > >> > >> > Heritage Laboratories >> > West Hartford, CT >> > >> > ______________________________________________ >> > R-help@r-project.org <mailto:R-help@r-project.org> >> <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> mailing >> list >> > https://stat.ethz.ch/mailman/listinfo/r-help >> > PLEASE do read the posting guide >> > >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> <http://www.r-project.org/posting-guide.html> >> > <http://www.r-project.org/posting-guide.html> >> > and provide commented, minimal, self-contained, >> reproducible code. >> > >> > >> > >> > -- >> > Frank E Harrell Jr Professor and Chair School of >> Medicine >> > Department of Biostatistics Vanderbilt >> University >> > >> > >> >> >> -- >> Frank E Harrell Jr Professor and Chair School of Medicine >> Department of Biostatistics Vanderbilt >> University >> >> ______________________________________________ >> R-help@r-project.org <mailto:R-help@r-project.org> mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> <http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> >> >> > > -- > Frank E Harrell Jr Professor and Chair School of Medicine > Department of Biostatistics Vanderbilt University > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.