I think the real issue is why the fit is being
done. If it is solely to interpolate and condense
the dataset, the number of variables is not an important issue.
If the issue is developing a model that will
capture causality, it is hard to believe that can
be accomplished with 50+ variables. With this
many, some kind of hunt would have to be done,
and the resulting model would not be real stable.
It would be better perhaps to first reduce the
variable set by, say, principal components
analysis, so that a reasonable sized set results.
If a stable and meaningful model is the goal,
each term in the final model should be plausibly causal.
At 10:36 AM 6/14/2010, Claudia Beleites wrote:
Dear all,
(this first part of the email I sent to John
earlier today, but forgot to put it to the list as well)
Dear John,
> Hi, this is not R technical question per se.
I know there are many excellent
> statisticians in this list, so here my questions: I have dataset with ~1800
> observations and 50 independent variables, so
there are about 35 samples per
> variable. Is it wise to build a stable multiple logistic model with 50
> independent variables? Any problem with this approach? Thanks
First: I'm not a statistician, but a spectroscopist.
But I do build logistic Regression models with
far less than 1800 samples and far more variates
(e.g. 75 patients / 256 spectral measurement
channels). Though I have many measurements per
sample: typically several hundred spectra per sample.
Question: are the 1800 real, independent samples?
Model stability is something you can measure.
Do a honest validation of your model with really
_independent_ test data and measure the
stability according to what your stability needs
are (e.g. stable parameters or stable predictions?).
(From here on reply to Joris)
> Marcs explanation is valid to a certain extent, but I don't agree with
> his conclusion. I'd like to point out "the curse of
> dimensionality"(Hughes effect) which starts to play rather quickly.
No doubt.
> The curse of dimensionality is easily demonstrated looking at the
> proximity between your datapoints. Say we scale the interval in one
> dimension to be 1 unit. If you have 20 evenly-spaced observations, the
> distance between the observations is 0.05 units. To have a proximity
> like that in a 2-dimensional space, you need 20^2=400 observations. in
> a 10 dimensional space this becomes 20^10 ~ 10^13 datapoints. The
> distance between your observations is important, as a sparse dataset
> will definitely make your model misbehave.
But won't also the distance between groups grow?
No doubt, that high-dimensional spaces are _very_ unintuitive.
However, the required sample size may grow
substantially slower, if the model has
appropriate restrictions. I remember the
recommendation of "at least 5 samples per class
and variate" for linear classification models.
I.e. not to get a good model, but to have a
reasonable chance of getting a stable model.
> Even with about 35 samples per variable, using 50 independent
> variables will render a highly unstable model,
Am I wrong thinking that there may be a
substantial difference between stability of
predictions and stability of model parameters?
BTW: if the models are unstable, there's also aggregation.
At least for my spectra I can give toy examples
with physical-chemical explanation that yield
the same prediction with different parameters
(of course because of correlation).
> as your dataspace is
> about as sparse as it can get. On top of that, interpreting a model
> with 50 variables is close to impossible,
No, not necessary. IMHO it depends very much on
the meaning of the variables. E.g. for the
spectra a set of model parameters may be
interpreted like spectra or difference spectra.
Of course this has to do with the fact, that a
parallel coordinate plot is the more "natural"
view of spectra compared to a point in so many dimensions.
> and then I didn't even start
> on interactions. No point in trying I'd say. If you really need all
> that information, you might want to take a look at some dimension
> reduction methods first.
Which puts to my mind a question I've had since long:
I assume that all variables that I know
beforehand to be without information are already discarded.
The dimensionality is then further reduced in a
data-driven way (e.g. by PCA or PLS). The model is built in the reduced space.
How much less samples are actually needed,
considering the fact that the dimension
reduction is a model estimated on the data?
...which of course also means that the honest
validation embraces the data-driven dimensionality reduction as well...
Are there recommendations about that?
The other curious question I have is:
I assume that it is impossible for him to obtain
the 10^xy samples required for comfortable model building.
So what is he to do?
Cheers,
Claudia
--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste
phone: +39 0 40 5 58-37 68
email: cbelei...@units.it
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: r...@lcfltd.com
Least Cost Formulations, Ltd. URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239 Fax: 757-467-2947
"Vere scire est per causas scire"
================================================================
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.