Re: [R] Logistic regression problem

Robert A LaBudde Wed, 01 Oct 2008 09:08:33 -0700

It would not be possible to answer your originalquestion until you specify your goal.

Is it to develop a model with external validitythat will generalize to new data? (You are notlikely to succeed, if you are starting with a"boil the ocean" approach with 44,000+ covariatesand millions of records.) This is the point Prof. Harrell is making.

Or is it to reduce a large dataset to a tractablepredictor formula that only interpolates your dataset?

If the former, you will need external modelinginformation to select the "wheat from the chaff"in your excessive predictor set.

Assuming it is the latter, then almost anyapproach that ends up with a tractable model(that has no meaning other than interpolation ofthis specific dataset) will be useful. For this,regression trees or even stepwise regressionwould work. The algorithm must be very simple andcomputer efficient. This is the area of data mining approaches.

I would suggest you start by looking at covariatepatterns to find out where the scarcity lies.These will end up high leverage data.

Another place to start is common sense: Thousandsof covariates cannot all contain independentinformation of value. Try to cluster them andpick the best representative from each clusterbased on expert knowledge. You may solve your problem quickly that way.


At 05:34 AM 10/1/2008, Bernardo Rangel Tura wrote:

Em Ter, 2008-09-30 Ã s 18:56 -0500, Frank EHarrell Jr escreveu: > Bernardo Rangel Turawrote: > > Em SÃ¡b, 2008-09-27 Ã s 10:51 -0700,milicic.marko escreveu: > >> I have a huge dataset with thousands of variable and onebinary > >> variable. I know that most of thevariables are correlated and are not > >> goodpredictors... but... > >> > >> It is very hardto start modeling with such a huge dataset. Whatwould > >> be your suggestion. How to make afirst cut... how to eliminate most > >> of thevariables but not to ignore potentialinteractions... for > >> example, maybe variableA is not good predictor and variable B isnot > >> good predictor either, but maybe A andB together are good > >> predictor... > >> > >>Any suggestion is welcomed > > > > > >milicic.marko > > > > I think do you start witha rpart("binary variable"~.) > > This show you aset of variables to start a model and the startset to > > curoff for continous variables > > Icannot imagine a worse way to formulate aregression model. Reasons > include > > 1.Results of recursive partitioning are nottrustworthy unless the > sample size exceeds50,000 or the signal to noise ratio is extremelyhigh. > > 2. The type I error of tests from thefinal regression model will be > extraordinarilyinflated. > > 3. False interactions will appearin the model. > > 4. The cutoffs so chosen willnot replicate and in effect assume that >covariate effects are discontinuous andpiecewise flat. The use of > cutoffs results ina huge loss of information and power and makesthe > analysis arbitrary and impossible tointerpret (e.g., a high covariate > value:lowcovariate value odds ratio or mean difference isa complex > function of all the covariate valuesin the sample). > > 5. The model will notvalidate in new data. Professor Frank, Thank youfor your explain. Well, if my first idea iswrong what is your opinion on the followingapproach? 1- Make PCA with data excluding thebinary variable 2- Put de principal componentsin logistic model 3- After revert principalcomponentes in variable (only if is interestingfor milicic.marko) If this approach is wrong toowhat is your approach? -- Bernardo Rangel Tura,M.D,MPH,Ph.D National Institute of CardiologyBrazil______________________________________________R-help@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-helpPLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html andprovide commented, minimal, self-contained, reproducible code.


================================================================
Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.            URL: http://lcfltd.com/
824 Timberlake Drive                     Tel: 757-467-0954
Virginia Beach, VA 23464-3239            Fax: 757-467-2947

"Vere scire est per causas scire"
================================================================

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Logistic regression problem

Reply via email to