Re: [R] logistic regression with 50 varaibales

Robert A LaBudde Mon, 14 Jun 2010 08:05:41 -0700

I think the real issue is why the fit is beingdone. If it is solely to interpolate and condensethe dataset, the number of variables is not an important issue.

If the issue is developing a model that willcapture causality, it is hard to believe that canbe accomplished with 50+ variables. With thismany, some kind of hunt would have to be done,and the resulting model would not be real stable.It would be better perhaps to first reduce thevariable set by, say, principal componentsanalysis, so that a reasonable sized set results.

If a stable and meaningful model is the goal,each term in the final model should be plausibly causal.


At 10:36 AM 6/14/2010, Claudia Beleites wrote:

Dear all,
(this first part of the email I sent to Johnearlier today, but forgot to put it to the list as well)
Dear John,
> Hi, this is not R technical question per se.I know there are many excellent
> statisticians in this list, so here my questions: I have dataset with ~1800
> observations and 50 independent variables, sothere are about 35 samples per
> variable. Is it wise to build a stable multiple logistic model with 50
> independent variables? Any problem with this approach? Thanks

First: I'm not a statistician, but a spectroscopist.
But I do build logistic Regression models withfar less than 1800 samples and far more variates(e.g. 75 patients / 256 spectral measurementchannels). Though I have many measurements persample: typically several hundred spectra per sample.
Question: are the 1800 real, independent samples?

Model stability is something you can measure.
Do a honest validation of your model with really_independent_ test data and measure thestability according to what your stability needsare (e.g. stable parameters or stable predictions?).
(From here on reply to Joris)

> Marcs explanation is valid to a certain extent, but I don't agree with
> his conclusion. I'd like to point out "the curse of
> dimensionality"(Hughes effect) which starts to play rather quickly.
No doubt.

> The curse of dimensionality is easily demonstrated looking at the
> proximity between your datapoints. Say we scale the interval in one
> dimension to be 1 unit. If you have 20 evenly-spaced observations, the
> distance between the observations is 0.05 units. To have a proximity
> like that in a 2-dimensional space, you need 20^2=400 observations. in
> a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
> distance between your observations is important, as a sparse dataset
> will definitely make your model misbehave.

But won't also the distance between groups grow?
No doubt, that high-dimensional spaces are _very_ unintuitive.
However, the required sample size may growsubstantially slower, if the model hasappropriate restrictions. I remember therecommendation of "at least 5 samples per classand variate" for linear classification models.I.e. not to get a good model, but to have areasonable chance of getting a stable model.
> Even with about 35 samples per variable, using 50 independent
> variables will render a highly unstable model,
Am I wrong thinking that there may be asubstantial difference between stability ofpredictions and stability of model parameters?
BTW: if the models are unstable, there's also aggregation.
At least for my spectra I can give toy exampleswith physical-chemical explanation that yieldthe same prediction with different parameters(of course because of correlation).
> as your dataspace is
> about as sparse as it can get. On top of that, interpreting a model
> with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much onthe meaning of the variables. E.g. for thespectra a set of model parameters may beinterpreted like spectra or difference spectra.Of course this has to do with the fact, that aparallel coordinate plot is the more "natural"view of spectra compared to a point in so many dimensions.
> and then I didn't even start
> on interactions. No point in trying I'd say. If you really need all
> that information, you might want to take a look at some dimension
> reduction methods first.

Which puts to my mind a question I've had since long:
I assume that all variables that I knowbeforehand to be without information are already discarded.The dimensionality is then further reduced in adata-driven way (e.g. by PCA or PLS). The model is built in the reduced space.
How much less samples are actually needed,considering the fact that the dimensionreduction is a model estimated on the data?...which of course also means that the honestvalidation embraces the data-driven dimensionality reduction as well...
Are there recommendations about that?


The other curious question I have is:
I assume that it is impossible for him to obtainthe 10^xy samples required for comfortable model building.
So what is he to do?


Cheers,

Claudia



--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbelei...@units.it

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] logistic regression with 50 varaibales

Reply via email to