Re: [R] variable selection in logistic

Frank E Harrell Jr Thu, 03 Sep 2009 10:14:44 -0700

annie Zhang wrote:

Thank you for all your reply.
Actually as Bert said, besides predicion, I also need variable selection(I need to know which variables are important). As far as the samplesize and number of variables, both of them are small around 35. How canI get accurate prediction as long as good predictors?
Annie

It is next to impossible to find a unique list of 'important' variableswithout having 50 times as many subjects as potential predictors, unlessyour signal:noise ratio is stunning.


Frank

On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter <gunter.ber...@gene.com<mailto:gunter.ber...@gene.com>> wrote:


    But let's be clear here folks:

    Ben's comment is apropos: ""As many variables as samples" is
    particularly
    scary."

    (Aside -- how much scarier then are -omics analyses in which the
    number of
    variables is thousands of times the number of samples?)

    Sensible penalization (it's usually not too sensitive to the details) is
    only another way of obtaining a parsimonious model with good (in the
    sense
    of minimizing overall prediction error: bias + variance) prediction
    properties. Alas, this is often not what scientists want: they use
    variable
    selection to find the "right" covariates, the "most important" variables
    affecting the response. But this is beyond the power of empirical
    modeling
    here: "as many variables as samples" almost guarantees that there
    will be
    many different and even nonoverlapping subsets of variables that
    are, within
    statistical noise, equally "optimal" predictors. That is, variable
    selection
    in such circumstances is just a pretty sophisticated random number
    generator
    -- ergo Frank's Draconian warnings. Penalization produces better
    prediction
    engines with better properties, but it cannot overcome the "as many
    variables as samples" problem either. Entropy rules. If what is
    sought is a
    way to determine the "truly important" variables, then the study must be
    designed to provide the information to do so. You don't get
    something for
    nothing.

    Cheers,

    Bert Gunter
    Genentech Nonclinical Biostatistics


    -----Original Message-----
    From: r-help-boun...@r-project.org
    <mailto:r-help-boun...@r-project.org>
    [mailto:r-help-boun...@r-project.org
    <mailto:r-help-boun...@r-project.org>] On
    Behalf Of Frank E Harrell Jr
    Sent: Wednesday, September 02, 2009 9:07 PM
    To: annie Zhang
    Cc: r-help@r-project.org <mailto:r-help@r-project.org>
    Subject: Re: [R] variable selection in logistic

    annie Zhang wrote:
     > Hi, Frank,
     >
     > You mean the backward and forward stepwise selection is bad? You also
     > suggest the penalized logistic regression is the best choice? Is
    there
     > any function to do it as well as selecting the best penalty?
     >
     > Annie

    All variable selection is bad unless its in the context of penalization.
     You'll need penalized logistic regression not necessarily with
    variable selection, for example a quadratic penalty as in a case study
    in my book, or an L1 penalty (lasso) using other packages.

    Frank

     >
     > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
     > <f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>
    <mailto:f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>>>
    wrote:
     >
     >     David Winsemius wrote:
     >
     >
     >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
     >
     >             Hi, R users,
     >
     >             What may be the best function in R to do variable
    selection
     >             in logistic
     >             regression?
     >
     >
     >         PhD theses, and books by famous statisticians have been
    pursuing
     >         the answer to that question for decades.
     >
     >             I have the same number of variables as the number of
    samples,
     >             and I want to select the best variablesfor prediction. Is
     >             there any function
     >             doing forward selection followed by backward
    elimination in
     >             stepwise
     >             logistic regression?
     >
     >
     >         You should probably be reading up on penalized regression
     >         methods. The stepwise procedures reporting unadjusted
     >         "significance" made available by SAS and SPSS to the unwary
     >         neophyte user have very poor statistical properties.
     >
     >         --
     >
     >         David Winsemius, MD
     >
     >
     >     Amen to that.
     >
     >     Annie, resist the temptation.  These methods bite.
     >
     >     Frank
     >
     >
     >         Heritage Laboratories
     >         West Hartford, CT
     >
     >         ______________________________________________
     >         R-help@r-project.org <mailto:R-help@r-project.org>
    <mailto:R-help@r-project.org <mailto:R-help@r-project.org>> mailing list
     >         https://stat.ethz.ch/mailman/listinfo/r-help
     >         PLEASE do read the posting guide
     >         http://www.R-project.org/posting-guide.html
    <http://www.r-project.org/posting-guide.html>
     >         <http://www.r-project.org/posting-guide.html>
     >         and provide commented, minimal, self-contained,
    reproducible code.
     >
     >
     >
     >     --
     >     Frank E Harrell Jr   Professor and Chair           School of
    Medicine
     >                         Department of Biostatistics   Vanderbilt
    University
     >
     >


    --
    Frank E Harrell Jr   Professor and Chair           School of Medicine
                         Department of Biostatistics   Vanderbilt University

    ______________________________________________
    R-help@r-project.org <mailto:R-help@r-project.org> mailing list
    https://stat.ethz.ch/mailman/listinfo/r-help
    PLEASE do read the posting guide
    http://www.R-project.org/posting-guide.html
    <http://www.r-project.org/posting-guide.html>
    and provide commented, minimal, self-contained, reproducible code.



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] variable selection in logistic

Reply via email to