I wonder whether any technique in statistics has failed so often, in so many applications, as the replication of results from stepwise regression.
On 6 Oct 2003 07:16:04 -0700, [EMAIL PROTECTED] (Bastian) wrote: > Hello, > > there are lots of suggestions for the minimum data for regression > models, i.e. > > 1) 1 var. for every 10 observations. > 2) Variables can be added until Adjusted R-square deviates > substantially Unadjusted R-square. > 3) With relatively large samples (n=100), a variable can be added > if is correlation with other variables is no larger than about > 0.80 or 0.85. > > My question is, whether the 1) suggestion should be condsidered with a > stepwise regression, too. That means when I do a stepwise regression > and get i.e. 20 variables are 200 obeserved values sufficient for > prediction or is it important to have as much observations as there > are "possible" variables before the use of the stepwise method. You want to look further, and find advice that is a lot more complete than what you cite. At first pass, all those things are moderately irrelevant or wrong, though they do make sense in the larger picture of model-building. You can ignore (1) because stepwise is almost always such a rotten idea; you can't save it with moderately large N. Huge N can save it, maybe, so long is the p-levels are something absurdly tiny. You can discount (2) because it does nothing to compensate for over-capitalizing on chance, by virtue of starting with a *bunch* of variables. For (3): A variable with high correlation offers, especially, the chance of new information by virtue of the *difference* between it and another variable -- this is my experience in biomedical circumstances. (3) could be entirely silly and wrong in some environment where .80 is not considered large. Concerning the errors of doing stepwise: You can find discussions by various people, which I saved to my stats-FAQ a few years ago. Or search the stats-groups with www.groups.google, to get (similar) comments that are more recent. Replication seems to be a big key for variable selection. Interior cross-validation is a necessity, and that means that you need large enough N that your effects will be meaningful (probably, larger than the minimum that would be 'statistically significant') even in subsamples. Selecting among variables is a big topic, and there is always interest in it, but the roles of the separate measures do matter; and the *reasons* for selecting are important. Do you want a shorter equation? Are you seeking ontological evidence? Are you scanning 4000 genes to see which one might associate with < something > ? Hope this helps. -- Rich Ulrich, [EMAIL PROTECTED] http://www.pitt.edu/~wpilib/index.html "Taxes are the price we pay for civilization." . . ================================================================= Instructions for joining and leaving this list, remarks about the problem of INAPPROPRIATE MESSAGES, and archives are available at: . http://jse.stat.ncsu.edu/ . =================================================================
