I just wrote an S-plus function that computes the statistical measures described in
"Inferring Formal Causation from Corresponding Regressions" William V. Chambers http://www.wynja.com/chambers/regression.html
but please check my logic. Here is the language in the publication.
> The answer follows from the logic of least squares
> regression. The prediction slope when predicting x1 from y
> will tend to be oriented by the extremes of y. An extreme
> value can lead to a relatively extreme error, especially
> since least squares analysis weights errors by squaring
> them. Thus, in order to reduce the overall error, the
> extremes of y will orient the prediction slope when
> predicting x1 from y. Consequently, absolute errors
> predicting x1 from y should correlate negatively with the
> absolute values of the deviations of the y values from
> their mean, since absolute deviations reflect extremity.
> Error will not be asymmetrically reduced across the levels
> of x1 when x1 serves as the predictor of y. This is because
> the correlation between x1 and error (x2) remains uniform
> across both extremes and mid-ranges of x variables.
> Therefore, rde(y), the correlation of the absolute errors
> predicting x1 from y, should be more negative than rde(x),
> the correlation between absolute deviations from x1 and
> absolute errors predicting y. This difference in the
> correlations of predictor extremities and regression
> residuals is a reflection of the asymmetrical formal
> relationships between IVs and DVs and is the basis of the
> method of corresponding regressions.
> A summary statistic, D, was found by subtracting rde(x)
> from rde(y).
This is the S-plus function
> corr.reg
function(x, y)
{
ey.x <- resid(lm(y ~ x))
ex.y <- resid(lm(x ~ y))
rde.y <- cor(abs(ex.y), abs(y - mean(y)))
rde.x <- cor(abs(ey.x), abs(x - mean(x)))
rde.y - rde.x
}
Even if you don't use S-plus, you should be able to follow this. Note that the syntax lm(y~x) fits a linear regression model with y as the dependent variable. The rest of the syntax should be obvious.
I tried it with some random uniform variables as follows:
> x1 <- runif(50)
> x2 <- runif(50)
> y <- x1+x2
> corr.reg(x1,y)
[1] -0.7029148
> corr.reg(x2,y)
[1] -0.6784307
> corr.reg(x1,x2)
[1] -0.02448412
So far, so good. A large negative value implies that x causes y. A value close to zero implies that neither variable causes the other. By inference, a large positive value implies that y causes x.
Then I tried it with a classic data set from Consumer Reports listed at
http://lib.stat.cmu.edu/DASL/Datafiles/Cars.html
> corr.reg(Weight,MPG)
[1] 0.4684389
> corr.reg(Drv.Rat,MPG)
[1] 0.2145594
> corr.reg(HP,MPG)
[1] 0.1312272
> corr.reg(Displ,MPG)
[1] 0.5197843
> corr.reg(Cyl,MPG)
[1] 0.2893959
I think I have the algorithm coded properly, but Dr. Chambers or anyone else could double check these results very easily.
The interpretation here makes little sense to me. A car's mileage cannot cause its Weight or Displacement.
A good regression model for this data would use Weight and Drv.Rat as independent variables. This model accounts for 89% of the variation in MPG. I don't see how Corresponding Regressions can either replicate these findings or develop any new insights into this data set.
It could be that this data set is unusual. I'll try this for a few other classic data sets and see what happens.
Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats.
