At 08:24 AM 7/24/2003 -0400, Peter Flom wrote:
(1) I've never liked this approach for a model with a constant, where it makes more sense to me to centre the data. I realize that opinions differ here, but it seems to me that failing to centre the data conflates collinearity with numerical instability. >>>
Opinions do differ. A few years ago, I could have given more details (my dissertation was on this topic, but a lot of the details have disappeared from memory); I think, though, that Belsley is looking for a measure that deals not only with collinearity, but with several other problems, including numerical instability (the subtitle of his later book is Collinearity and Weak Data in Regression). I remember being convinced that centering was generally not a good idea, but there are lots of people who disagree and who know a lot more statistics than I do.
To elaborate my remark slightly, in most problems the intercept is not of much interest. When the data are far from the origin, it's natural that the intercept isn't well estimated. When data are very far from the origin, computations with the uncentred data may be numerically unstable (depending upon how the computations are done) because of "collinearity with the intercept." If the real interest is in the coefficients other than the intercept, this seems to me purely a numerical artefact. The possibly more generally interesting sense of "collinearity" is imprecision in estimation due to strong relationships among the predictors.
. . .
<<< (4) I have doubts about the whole enterprise. Collinearity is one source of imprecision -- others are small sample size, homogeneous predictors, and large error variance. Aren't the coefficient standard errors the bottom
line? If these are sufficiently small, why worry? >>>
I think (correct me if I am wrong) that the s.e.s and the condition indices serve very different purposes. The condition indices are supposed to determine if small changes in the input data could make big differences in the results. Belsley provides some examples where a tiny change in the data results in completely different results (e.g., different standard errors, different coefficients (even reversing sign) and so on).
Indeed, ill-conditioned data produce unstable numerical solutions (even affected by how the data are rounded), but condition indices aren't a particularly effective way of looking for instability in a more general sense. Consider, for example, Anscombe's famous simple-regression examples, which are in the data frame Quartet in the car package. The fourth example has a highly influential data point (number 8):
> Quartet[, c("x4", "y4")] x4 y4 1 8 6.58 2 8 5.76 3 8 7.71 4 8 8.84 5 8 8.47 6 8 7.04 7 8 5.25 8 19 12.50 9 8 5.56 10 8 7.91 11 8 6.89
The regression of y4 on x4 isn't especially ill-conditioned (using the function I posted yesterday):
> mod <- lm(y4 ~ x4) > belsley(mod)
Singular values: 1.394079 0.2377891 Condition indices: 1 5.86267
Variance-decomposition proportions (Intercept) x4 1 0.028 0.028 2 0.972 0.972
but the 8th observation has an infinite Cook's D:
> round(cooks.distance(mod), 2) 1 2 3 4 5 6 7 8 9 10 11 0.01 0.06 0.02 0.14 0.09 0.00 0.12 Inf 0.08 0.03 0.00
Regards, John ----------------------------------------------------- John Fox Department of Sociology McMaster University Hamilton, Ontario, Canada L8S 4M4 email: [EMAIL PROTECTED] phone: 905-525-9140x23604 web: www.socsci.mcmaster.ca/jfox
______________________________________________ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
