Re: [R] variable selection in logistic

annie Zhang Thu, 03 Sep 2009 13:22:55 -0700

Hi, Frank,

If I want to do prediction as well as to select important predictors, which
may be the best function to use when I have 35 samples and 35 predictors
(penalized logistic with variable selection)? I saw there is a 'fastbw'
function in the Design package. And there is a 'step.plr' function in the
'stepPlr' package.


Thank you,

Annie

On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr <
f.harr...@vanderbilt.edu> wrote:

> annie Zhang wrote:
>
>> Thank you for all your reply.
>> Actually as Bert said, besides predicion, I also need variable selection
>> (I need to know which variables are important). As far as the sample size
>> and number of variables, both of them are small around 35. How can I get
>> accurate prediction as long as good predictors?
>> Annie
>>
>
> It is next to impossible to find a unique list of 'important' variables
> without having 50 times as many subjects as potential predictors, unless
> your signal:noise ratio is stunning.
>
> Frank
>
>
>> On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter <gunter.ber...@gene.com<mailto:
>> gunter.ber...@gene.com>> wrote:
>>
>>    But let's be clear here folks:
>>
>>    Ben's comment is apropos: ""As many variables as samples" is
>>    particularly
>>    scary."
>>
>>    (Aside -- how much scarier then are -omics analyses in which the
>>    number of
>>    variables is thousands of times the number of samples?)
>>
>>    Sensible penalization (it's usually not too sensitive to the details)
>> is
>>    only another way of obtaining a parsimonious model with good (in the
>>    sense
>>    of minimizing overall prediction error: bias + variance) prediction
>>    properties. Alas, this is often not what scientists want: they use
>>    variable
>>    selection to find the "right" covariates, the "most important"
>> variables
>>    affecting the response. But this is beyond the power of empirical
>>    modeling
>>    here: "as many variables as samples" almost guarantees that there
>>    will be
>>    many different and even nonoverlapping subsets of variables that
>>    are, within
>>    statistical noise, equally "optimal" predictors. That is, variable
>>    selection
>>    in such circumstances is just a pretty sophisticated random number
>>    generator
>>    -- ergo Frank's Draconian warnings. Penalization produces better
>>    prediction
>>    engines with better properties, but it cannot overcome the "as many
>>    variables as samples" problem either. Entropy rules. If what is
>>    sought is a
>>    way to determine the "truly important" variables, then the study must
>> be
>>    designed to provide the information to do so. You don't get
>>    something for
>>    nothing.
>>
>>    Cheers,
>>
>>    Bert Gunter
>>    Genentech Nonclinical Biostatistics
>>
>>
>>    -----Original Message-----
>>    From: r-help-boun...@r-project.org
>>    <mailto:r-help-boun...@r-project.org>
>>    [mailto:r-help-boun...@r-project.org
>>    <mailto:r-help-boun...@r-project.org>] On
>>    Behalf Of Frank E Harrell Jr
>>    Sent: Wednesday, September 02, 2009 9:07 PM
>>    To: annie Zhang
>>    Cc: r-help@r-project.org <mailto:r-help@r-project.org>
>>    Subject: Re: [R] variable selection in logistic
>>
>>    annie Zhang wrote:
>>     > Hi, Frank,
>>     >
>>     > You mean the backward and forward stepwise selection is bad? You
>> also
>>     > suggest the penalized logistic regression is the best choice? Is
>>    there
>>     > any function to do it as well as selecting the best penalty?
>>     >
>>     > Annie
>>
>>    All variable selection is bad unless its in the context of
>> penalization.
>>     You'll need penalized logistic regression not necessarily with
>>    variable selection, for example a quadratic penalty as in a case study
>>    in my book, or an L1 penalty (lasso) using other packages.
>>
>>    Frank
>>
>>     >
>>     > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
>>     > <f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>
>>    <mailto:f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>>>
>>
>>    wrote:
>>     >
>>     >     David Winsemius wrote:
>>     >
>>     >
>>     >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
>>     >
>>     >             Hi, R users,
>>     >
>>     >             What may be the best function in R to do variable
>>    selection
>>     >             in logistic
>>     >             regression?
>>     >
>>     >
>>     >         PhD theses, and books by famous statisticians have been
>>    pursuing
>>     >         the answer to that question for decades.
>>     >
>>     >             I have the same number of variables as the number of
>>    samples,
>>     >             and I want to select the best variablesfor prediction.
>> Is
>>     >             there any function
>>     >             doing forward selection followed by backward
>>    elimination in
>>     >             stepwise
>>     >             logistic regression?
>>     >
>>     >
>>     >         You should probably be reading up on penalized regression
>>     >         methods. The stepwise procedures reporting unadjusted
>>     >         "significance" made available by SAS and SPSS to the unwary
>>     >         neophyte user have very poor statistical properties.
>>     >
>>     >         --
>>     >
>>     >         David Winsemius, MD
>>     >
>>     >
>>     >     Amen to that.
>>     >
>>     >     Annie, resist the temptation.  These methods bite.
>>     >
>>     >     Frank
>>     >
>>     >
>>     >         Heritage Laboratories
>>     >         West Hartford, CT
>>     >
>>     >         ______________________________________________
>>     >         R-help@r-project.org <mailto:R-help@r-project.org>
>>    <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> mailing
>> list
>>     >         https://stat.ethz.ch/mailman/listinfo/r-help
>>     >         PLEASE do read the posting guide
>>     >         
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>    <http://www.r-project.org/posting-guide.html>
>>     >         <http://www.r-project.org/posting-guide.html>
>>     >         and provide commented, minimal, self-contained,
>>    reproducible code.
>>     >
>>     >
>>     >
>>     >     --
>>     >     Frank E Harrell Jr   Professor and Chair           School of
>>    Medicine
>>     >                         Department of Biostatistics   Vanderbilt
>>    University
>>     >
>>     >
>>
>>
>>    --
>>    Frank E Harrell Jr   Professor and Chair           School of Medicine
>>                         Department of Biostatistics   Vanderbilt
>> University
>>
>>    ______________________________________________
>>    R-help@r-project.org <mailto:R-help@r-project.org> mailing list
>>    https://stat.ethz.ch/mailman/listinfo/r-help
>>    PLEASE do read the posting guide
>>    
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>>    <http://www.r-project.org/posting-guide.html>
>>    and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>
> --
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                     Department of Biostatistics   Vanderbilt University
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] variable selection in logistic

Reply via email to