Re: [R] Condition indexes and variance inflation factors

John Fox Thu, 24 Jul 2003 07:01:08 -0700

Dear Peter,

At 08:24 AM 7/24/2003 -0400, Peter Flom wrote:

(1) I've never liked this approach for a model with a constant, where
it
makes more sense to me to centre the data. I realize that opinions
differ
here, but it seems to me that failing to centre the data conflates
collinearity with numerical instability.
>>>

Opinions do differ.  A few years ago, I could have given more details
(my dissertation was on this topic, but a lot of the details have
disappeared from memory); I think, though, that Belsley is looking for a
measure that deals not only with collinearity, but with several other
problems, including numerical instability (the subtitle of his later
book is Collinearity and Weak Data in Regression).  I remember being
convinced that centering was generally not a good idea, but there are
lots of people who disagree and who know a lot more statistics than I
do.

To elaborate my remark slightly, in most problems the intercept is not of much interest. When the data are far from the origin, it's natural that the intercept isn't well estimated. When data are very far from the origin, computations with the uncentred data may be numerically unstable (depending upon how the computations are done) because of "collinearity with the intercept." If the real interest is in the coefficients other than the intercept, this seems to me purely a numerical artefact. The possibly more generally interesting sense of "collinearity" is imprecision in estimation due to strong relationships among the predictors.

. . .

<<<
(4) I have doubts about the whole enterprise. Collinearity is one
source of
imprecision -- others are small sample size, homogeneous predictors,
and
large error variance. Aren't the coefficient standard errors the bottom

line? If these are sufficiently small, why worry?
>>>

I think (correct me if I am wrong) that the s.e.s and the condition
indices serve very different purposes.  The condition indices are
supposed to determine if small changes in the input data could make big
differences in the results.  Belsley provides some examples where a tiny
change in the data results in completely different results (e.g.,
different standard errors, different coefficients (even reversing sign)
and so on).

Indeed, ill-conditioned data produce unstable numerical solutions (even affected by how the data are rounded), but condition indices aren't a particularly effective way of looking for instability in a more general sense. Consider, for example, Anscombe's famous simple-regression examples, which are in the data frame Quartet in the car package. The fourth example has a highly influential data point (number 8):


> Quartet[, c("x4", "y4")]
   x4    y4
1   8  6.58
2   8  5.76
3   8  7.71
4   8  8.84
5   8  8.47
6   8  7.04
7   8  5.25
8  19 12.50
9   8  5.56
10  8  7.91
11  8  6.89

The regression of y4 on x4 isn't especially ill-conditioned (using the function I posted yesterday):

> mod <- lm(y4 ~ x4)
> belsley(mod)

Singular values:  1.394079 0.2377891
Condition indices:  1 5.86267

Variance-decomposition proportions
  (Intercept)    x4
1       0.028 0.028
2       0.972 0.972

but the 8th observation has an infinite Cook's D:

> round(cooks.distance(mod), 2)
   1    2    3    4    5    6    7    8    9   10   11
0.01 0.06 0.02 0.14 0.09 0.00 0.12  Inf 0.08 0.03 0.00


Regards,
 John
-----------------------------------------------------
John Fox
Department of Sociology
McMaster University
Hamilton, Ontario, Canada L8S 4M4
email: [EMAIL PROTECTED]
phone: 905-525-9140x23604
web: www.socsci.mcmaster.ca/jfox

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Re: [R] Condition indexes and variance inflation factors

Reply via email to