This may depend in part on how you're determining "significant" vs.
"not significant".  Your predictors are almost certainly correlated
to some degree;  so if you're looking at the collection of tests on
the regression coefficients at the end of a regression analysis, and
discarding all the variables for which the F (or t) values are "not
significant", you're almost certainly discarding more variables than
you "should", in some sense.

[Those standard tests ask, for each variable, "How much do you lose
if you omit this variable in the presence of all the others?".  If
you discard more than one variable from a superfluity of predictors,
you may be discarding several which collectively contribute usefully
(and "significantly") to your prediction, but each of which in the
presence of one or more others (in this set of variables just
discarded) does not contribute much of anything extra.  In the absence
of all those related variables, there may be one (or several) which
would now contribute something useful on its (or their) own.]

However, if you're discarding variables after looking at them one by
one (perhaps on the basis of the "extra sum of squares" contributed
by each variable), the explanation above (for incorrectly identifying
some variables as "not significant") doesn't apply.

If my conjecture does describe your situation, however, one approach
is to orthogonalize the predictors with respect to each other.
This is easiest to make sense of if the variables have some inherent
order (of "importance", or "preference", e.g.):  one would wish to
orthogonalize the later (less important, less preferred) variables
with respect to the earlier ones.  For details, see Draper and Smith,
"Applied Regression Analysis" (Wiley), 2nd edition or later, section
(in Chapter 4 or 5 as I recall) on "orthogonalizing the X matrix".

Also, and especially if your predictors include interaction effects
among your raw predictors, see my White Paper on the Minitab web
site at www.minitab.com, on "Modelling and interpreting interactions
in multiple regression".
                                 -- DFB.
 -----------------------------------------------------------------------
 Donald F. Burrill                                            [EMAIL PROTECTED]
 56 Sebbins Pond Drive, Bedford, NH 03110                 (603) 626-0816

 ================= Original message: =====================
Date: 18 Feb 2003 11:41:56 -0800
From: punky brewster <[EMAIL PROTECTED]>

How important is statistical significance of coefficients in
"predictive modeling"?

That is, let's say one is attempting to predict response to a
marketing campaign using a logistic regression, and produces two
models.  The first model predicts 76% of cases correctly, and has some
coefficients that are statistically significant and a number that are
not statisticall significant.  The second model, on the other hand,
contains only variables that are statistically significant, but
predicts only 61% of cases correctly.

For the purposes of prediction only -- one does not care at all about
hypothesis testing for any of the coefficients in the model -- which is
a "better" model and why?

Furthermore, where can I read up to get a better grasp on this?

.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to