Re: [R] Logistic regression problem

Frank E Harrell Jr Tue, 30 Sep 2008 20:19:25 -0700

Bernardo Rangel Tura wrote:

Em Sáb, 2008-09-27 às 10:51 -0700, milicic.marko escreveu:

I have a huge data set with thousands of variable and one binary
variable. I know that most of the variables are correlated and are not
good predictors... but...


It is very hard to start modeling with such a huge dataset. What would
be your suggestion. How to make a first cut... how to eliminate most
of the variables but not to ignore potential interactions... for
example, maybe variable A is not good predictor and variable B is not
good predictor either, but maybe A and B together are good
predictor...

Any suggestion is welcomed



milicic.marko

I think do you start with a rpart("binary variable"~.)
This show you a set of variables to start a model and the start set to
curoff  for continous variables

I cannot imagine a worse way to formulate a regression model. Reasonsinclude

1. Results of recursive partitioning are not trustworthy unless thesample size exceeds 50,000 or the signal to noise ratio is extremely high.

2. The type I error of tests from the final regression model will beextraordinarily inflated.


3. False interactions will appear in the model.

4. The cutoffs so chosen will not replicate and in effect assume thatcovariate effects are discontinuous and piecewise flat. The use ofcutoffs results in a huge loss of information and power and makes theanalysis arbitrary and impossible to interpret (e.g., a high covariatevalue:low covariate value odds ratio or mean difference is a complexfunction of all the covariate values in the sample).


5. The model will not validate in new data.

Frank
--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Logistic regression problem

Reply via email to