I wonder whether any technique in statistics has failed so
often, in so many applications, as the replication of
results from stepwise regression.

On 6 Oct 2003 07:16:04 -0700, [EMAIL PROTECTED] (Bastian) wrote:

> Hello,
> 
> there are lots of suggestions for the minimum data for regression
> models, i.e.
> 
> 1) 1 var. for every 10 observations.
> 2) Variables can be added until Adjusted R-square deviates
>    substantially Unadjusted R-square.
> 3) With relatively large samples (n=100), a variable can be added
>    if is correlation with other variables is no larger than about
>    0.80 or 0.85.
> 
> My question is, whether the 1) suggestion should be condsidered with a
> stepwise regression, too. That means when I do a stepwise regression
> and get i.e. 20 variables are 200 obeserved values sufficient for
> prediction or is it important to have as much observations as there
> are "possible" variables before the use of the stepwise method.

You want to look further, and find advice that is a lot more
complete than what you cite.  At first pass, all those things 
are moderately irrelevant or wrong, though they do make 
sense in the larger picture of model-building.

You can ignore (1)  because stepwise is almost always
such a rotten idea; you can't save it with moderately large N.
Huge N  can save it, maybe, so long is the p-levels are 
something absurdly tiny.

You can discount (2)  because it does nothing to compensate
for over-capitalizing on chance, by virtue of starting with
a *bunch*  of variables.

For (3):  A variable with high correlation offers, especially, 
the chance of new information by virtue of the *difference*
between it  and another variable -- this is my experience in
biomedical circumstances.  (3)  could be entirely silly and
wrong in some environment  where  .80  is not considered
large.

Concerning the errors of doing stepwise:
You can find discussions by various people, which I saved to
my stats-FAQ  a few years ago.  Or search the stats-groups 
with  www.groups.google,  to  get (similar) comments that are 
more recent.

Replication seems to be a big key for variable selection.
Interior cross-validation is a necessity, and that means
that you need large enough  N  that  your effects will be
meaningful (probably, larger than the minimum that would
be 'statistically significant')  even in subsamples.

Selecting among variables is a big topic, and there is always
interest in it, but the  roles of the separate measures do matter;
and the  *reasons*  for selecting are important.  
Do you want a shorter equation?  
Are you seeking ontological evidence?
Are you scanning 4000 genes to see which one might
associate with  < something > ?

Hope this helps.
-- 
Rich Ulrich, [EMAIL PROTECTED]
http://www.pitt.edu/~wpilib/index.html
"Taxes are the price we pay for civilization." 
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to