It would not be possible to answer your original
question until you specify your goal.
Is it to develop a model with external validity
that will generalize to new data? (You are not
likely to succeed, if you are starting with a
"boil the ocean" approach with 44,000+ covariates
and millions of records.) This is the point Prof. Harrell is making.
Or is it to reduce a large dataset to a tractable
predictor formula that only interpolates your dataset?
If the former, you will need external modeling
information to select the "wheat from the chaff"
in your excessive predictor set.
Assuming it is the latter, then almost any
approach that ends up with a tractable model
(that has no meaning other than interpolation of
this specific dataset) will be useful. For this,
regression trees or even stepwise regression
would work. The algorithm must be very simple and
computer efficient. This is the area of data mining approaches.
I would suggest you start by looking at covariate
patterns to find out where the scarcity lies.
These will end up high leverage data.
Another place to start is common sense: Thousands
of covariates cannot all contain independent
information of value. Try to cluster them and
pick the best representative from each cluster
based on expert knowledge. You may solve your problem quickly that way.
At 05:34 AM 10/1/2008, Bernardo Rangel Tura wrote:
Em Ter, 2008-09-30 Ã s 18:56 -0500, Frank E
Harrell Jr escreveu: > Bernardo Rangel Tura
wrote: > > Em Sáb, 2008-09-27 às 10:51 -0700,
milicic.marko escreveu: > >> I have a huge data
set with thousands of variable and one
binary > >> variable. I know that most of the
variables are correlated and are not > >> good
predictors... but... > >> > >> It is very hard
to start modeling with such a huge dataset. What
would > >> be your suggestion. How to make a
first cut... how to eliminate most > >> of the
variables but not to ignore potential
interactions... for > >> example, maybe variable
A is not good predictor and variable B is
not > >> good predictor either, but maybe A and
B together are good > >> predictor... > >> > >>
Any suggestion is welcomed > > > > > >
milicic.marko > > > > I think do you start with
a rpart("binary variable"~.) > > This show you a
set of variables to start a model and the start
set to > > curoff for continous variables > > I
cannot imagine a worse way to formulate a
regression model. Reasons > include > > 1.
Results of recursive partitioning are not
trustworthy unless the > sample size exceeds
50,000 or the signal to noise ratio is extremely
high. > > 2. The type I error of tests from the
final regression model will be > extraordinarily
inflated. > > 3. False interactions will appear
in the model. > > 4. The cutoffs so chosen will
not replicate and in effect assume that >
covariate effects are discontinuous and
piecewise flat. The use of > cutoffs results in
a huge loss of information and power and makes
the > analysis arbitrary and impossible to
interpret (e.g., a high covariate > value:low
covariate value odds ratio or mean difference is
a complex > function of all the covariate values
in the sample). > > 5. The model will not
validate in new data. Professor Frank, Thank you
for your explain. Well, if my first idea is
wrong what is your opinion on the following
approach? 1- Make PCA with data excluding the
binary variable 2- Put de principal components
in logistic model 3- After revert principal
componentes in variable (only if is interesting
for milicic.marko) If this approach is wrong too
what is your approach? -- Bernardo Rangel Tura,
M.D,MPH,Ph.D National Institute of Cardiology
Brazil
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and
provide commented, minimal, self-contained, reproducible code.
================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd. URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239 Fax: 757-467-2947
"Vere scire est per causas scire"
================================================================
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.