Applying Corresponding Regressions across five data sets

Simon, Steve, PhD Wed, 18 Sep 2002 16:18:07 -0700

I've adapted a new statistic, CC, that Dr. Chambers suggested. It is a bit
unusual in that the variables x and y do not participate in a reciprocal
fashion. So you need to compute CC both ways. A large negative value for
cc.y implies that x causes y. A large negative value for cc.x implies that x
causes x. It is theoretically possible that cc.y and cc.x could both be
large negative.


I also got the comment that D is uninterpretable when rde(y) and rde(x) are
both positive.

Here is an explanation from Dr. Chambers.

> D is ambiguous without knowledge of the rde values that 
> generate them.  Two positive rdes do not imply causation. I 
> do not know what they imply but I do not see them in causal 
> simulations.
>  
>  Do present both both rde(y) and rde(x).
>  
>  
> A zero rde is the expected when the variables are 
> uncorrelated independent variables and partitioned by 
> either of the X variables (NOT by the Y variable).This 
> condition is equal to Thurstone's simple structure. He 
> found that this condition maintains when the data are 
> partitioned by either x variable. See his book Multiple 
> Factor Analysis. 
>  
> Since my last post I have been working out CC using minitab 
> and will post my results for the auto data tonight and the 
> usual y=x1+x2. 
>  
> So far the equation as I have rewritten/rediscovered it 
> just now is:
>  
> 1. where y=x1+x2 and x1 and x2 are uniform  (This is the 
> causal simulation)
>  
> 2.  Assume we measure only x1 and y on arbitrary scales   
> (This is the data collection simulation)
>  
> 3. Convert x1 and y to zscores :  zx and zy  (This is the 
> data preparation stage)
>  
> 4. Calculate the residual X. by :  ((square root of 
> 2)*zy)-zx1      This produces an estimate of zx2 (r=.99). 
> Call it resx. The square root of 2 reproportions the y 
> variance, giving a much better residual than would zy-zx1.  
> This works for the current model but for different levels 
> of correlation between zx1 and zy a different value than 2 
> might be used .... I do not know.)
>  
> 5.  find Diff= abs(zx1-resx)
>  
> 6. calculate extremity of y by Ext=abs(zy)
>  
> 7 Calculate CC by  correlating  Diff and Ext.
>  
> It should be negative when the hypothesed variables fit the 
> x and y roles and the model is additive.
>  
> Try this and see what you think. It may be that rde is just 
> as good or identical. I have not calcuated rde since  my 
> GAUSS program is no longer installed. I originally 
> programmed D in GAUSS.

These changes complicate things and add subjectivity. So it is possible that
a critic of CR might interpret these results differently than a proponent of
CR.

Also, I may have programmed CC incorrectly. But here goes.

corr.reg <- function(x,y) {
  ey.x <- resid(lm(y~x))
  ex.y <- resid(lm(x~y))
  rde.y <- cor(abs(ex.y),abs(y-mean(y)))
  rde.x <- cor(abs(ey.x),abs(x-mean(x)))
  d <- rde.y-rde.x

  z.x <- (x-mean(x))/sqrt(var(x))
  z.y <- (y-mean(y))/sqrt(var(y))
  res.x <- sqrt(2)*z.y-z.x
  res.y <- sqrt(2)*z.x-z.y
  cc.y <- cor(abs(z.x-res.x),abs(z.y))
  cc.x <- cor(abs(z.y-res.y),abs(z.x))
  cat(paste("\n      D =",round(d,2),
            "\n rde(y) =",round(rde.y,2),
            "\n rde(x) =",round(rde.x,2),
            "\n   cc.y =",round(cc.y,2),
            "\n   cc.x =",round(cc.x,2),"\n\n"))
}

> x1 <- runif(50)
> x2 <- runif(50)
> y <- x1+x2
> 
> corr.reg(x1,y)

      D = -0.72 
 rde(y) = -0.51 
 rde(x) = 0.21 
   cc.y = -0.51 
   cc.x = 0.21 

> corr.reg(x2,y)

      D = -0.79 
 rde(y) = -0.51 
 rde(x) = 0.28 
   cc.y = -0.55 
   cc.x = 0.26 

> corr.reg(x1,x2)

      D = 0.07 
 rde(y) = 0.28 
 rde(x) = 0.21 
   cc.y = 0.27 
   cc.x = 0.37 

Both D and CC behave nicely here.

> 
> attach(cars)
> corr.reg(Weight,MPG)

      D = 0.47 
 rde(y) = 0.23 
 rde(x) = -0.24 
   cc.y = 0.84 
   cc.x = 0.9 

> corr.reg(Drv.Rat,MPG)

      D = 0.21 
 rde(y) = 0.12 
 rde(x) = -0.1 
   cc.y = 0.1 
   cc.x = -0.2 

> corr.reg(HP,MPG)

      D = 0.13 
 rde(y) = -0.08 
 rde(x) = -0.21 
   cc.y = 0.83 
   cc.x = 0.85 

> corr.reg(Displ,MPG)

      D = 0.52 
 rde(y) = 0.19 
 rde(x) = -0.33 
   cc.y = 0.73 
   cc.x = 0.81 

> corr.reg(Cyl,MPG)

      D = 0.29 
 rde(y) = -0.1 
 rde(x) = -0.39 
   cc.y = 0.72 
   cc.x = 0.7 

We've seen this data before. CC implies no causation, except perhaps that
MPG causes Drive Ratio. The D statistics provide totally counterintuitive
results. In none of the cases can this be attributed to two positive values
for rde.

> 
> # http://lib.stat.cmu.edu/DASL/Datafiles/homedat.html
> 
> detach()
> attach(housing)
> corr.reg(sqft,price)

      D = -0.4 
 rde(y) = 0.13 
 rde(x) = 0.53 
   cc.y = 0.07 
   cc.x = 0.5 

> corr.reg(age[is.finite(age)],price[is.finite(age)])

      D = 0.04 
 rde(y) = 0.13 
 rde(x) = 0.08 
   cc.y = 0.52 
   cc.x = 0.53 

> corr.reg(feats,price)

      D = 0.07 
 rde(y) = 0.15 
 rde(x) = 0.08 
   cc.y = 0.25 
   cc.x = 0.21 

> corr.reg(nec,price)

      D = -0.12 
 rde(y) = -0.3 
 rde(x) = -0.18 
   cc.y = 0.21 
   cc.x = -0.09 

> corr.reg(cust,price)

      D = -0.46 
 rde(y) = -0.04 
 rde(x) = 0.42 
   cc.y = -0.05 
   cc.x = 0.45 

> corr.reg(cor,price)

      D = -0.04 
 rde(y) = -0.21 
 rde(x) = -0.17 
   cc.y = 0.42 
   cc.x = 0.4 

> corr.reg(cust,sqft)

      D = -0.28 
 rde(y) = 0 
 rde(x) = 0.28 
   cc.y = 0.04 
   cc.x = 0.3 

The D values behave nicely here. The size of the house and whether it is
custom built cause the price. The CC values say nothing at all, unless you
want to trust a value like -0.05 or -0.09.

> 
> # http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html
> 
> detach()
> attach(cigarettes)
> corr.reg(cig,bladder)

      D = 0.53 
 rde(y) = 0.3 
 rde(x) = -0.22 
   cc.y = 0.3 
   cc.x = -0.22 

> corr.reg(cig,lung)

      D = 0.26 
 rde(y) = 0.24 
 rde(x) = -0.01 
   cc.y = 0.25 
   cc.x = -0.01 

> corr.reg(cig,kidney)

      D = -0.27 
 rde(y) = 0.01 
 rde(x) = 0.28 
   cc.y = 0.09 
   cc.x = 0.36 

> corr.reg(cig,leuk)

      D = -0.06 
 rde(y) = -0.06 
 rde(x) = 0 
   cc.y = 0.48 
   cc.x = 0.32 

This is about as bad as the car mileage data. Both the CC and D values imply
that bladder cancer causes cigarette smoking. The D value implies that lung
cancer causes smoking. For kidney cancer, the D statistic gets it right, but
we need to throw out this result because D is unreliable when both rde
values are positive.
 
> 
> # http://lib.stat.cmu.edu/DASL/Datafiles/Ageandheight.html
> 
> child.ht <- c(76.1, 77, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8,
82.8, 83.5)
> child.age <- 18:29
> corr.reg(child.age,child.ht)

      D = -0.16 
 rde(y) = -0.26 
 rde(x) = -0.11 
   cc.y = 0.76 
   cc.x = 0.83 

The D value is good here, as age causes height. The CC value is ambiguous.

> 
> # http://lib.stat.cmu.edu/DASL/Datafiles/SmokingandCancer.html
> 
> detach()
> attach(smoking)
> corr.reg(Smoking,Mortality)

      D = -0.26 
 rde(y) = -0.23 
 rde(x) = 0.04 
   cc.y = -0.22 
   cc.x = 0.04 

> detach()
>

Good! The D value and the CC value both imply that smoking causes mortality.

Everyone will have a different conclusion, perhaps, but my conclusion is
that the currently defined statistics of D and CC are not helpful in
identifying causes across a wide variety of data sets. D appears to behave a
bit better than CC, but still provides highly counter-intuitive findings in
two of the five data sets.

Feel free to offer your own interpretations. And if there is an error in my
code, please let me know.

Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats.


.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Applying Corresponding Regressions across five data sets

Reply via email to