Karen Scheltema writes:

> I know about the perils of stepwise, and I agree with you that
> it is a less than desirable procedure. This researcher,
> however, is not as convinced as I am about not doing stepwise. Sigh.
> He has more variables than would comfortably fit a 5-1 case to
> variable ratio for a forced entry regression, which is why he was
> hoping stepwise would help him narrow his model. Any suggestions I
> can give him, short of telling him to scrap everything?

The five to one ratio (actually, I've heard ten to one or fifteen to one)
refers to the number of candidate variables and not the number of variables
in the final stepwise model. If you read the original papers that developed
these ratios, they were developed to see how well stepwise regression would
do. And stepwise regression is highly unstable and often selects the wrong
variables when the number of variables going into stepwise (not the number
coming out) is large relative to the number of observations.

So stepwise does not solve anything. The only real solution is to eliminate
certain variables a priori using medical and scientific criteria.

Along with the relative lack of data, you also have high VIFs. Both of these
indicate that the model is unstable and will be unlikely to replicate well
with a different data set. A high VIF is actually another indication of a
relative lack of data. His independent variables do not effectively fill up
the k-dimenional hyperspace but instead fall close to a lower dimensional
hyperspace. In simple terms, some of the corners in his data space are
empty.

That's actually good news in a way. It means that his data set is good for
generating hypotheses but not for confirming hypotheses. Try to get him to
focus on exploratory models--draw lots of graphs and use words like
"suggestive of a trend". Don't pretend that the confidence intervals and
p-values are proving a whole lot. Try to include as few of these as possible
in the final publication.

Don't focus on a single model. A series of single variable regression models
may be more informative than a single multiple variable regression model.

Don't worry whether you have the "right" model or not. Your model is almost
certainly wrong. That's liberating. If all approaches are likely to yield
the wrong results, then you can't be faulted for using (or not using) any
particular approach. Just use any reasonable approach and if you put in a
lot of caveats ("further study with a larger data set should be done") then
you should be okay.

It's a rare data set that should be totally scrapped. It may only provide
weak evidence, but it still helps point future researchers in the right
direction. The only sin here is to pretend that this data is definitive or
the final word.

Good luck!

Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats.

P.S. There are a bunch of new methods that can handle model selection better
than stepwise, but these might be overkill for your application. I'm just
starting to look at these approaches (the lasso, bagging, and boosting) so I
can't say anything other than they have cute names.
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to