Re: Applying Corresponding Regressions across five data sets

bijag Wed, 18 Sep 2002 17:28:01 -0700

Steve,

Interesting and challenging data.  I am puzzled. however. as to why the cc
and rde statistics are the same or nearly the same in the simulations but
depart markedly in the real data. Any suggestions of why?


As I mentioned, the cc method requires the calculation of sums and
differences between two variables. If the orginal causation was based on a
subtractive rather than an additive model, then instead of calculating  the
absolute differences we should calculate the absolute sums.  Another issue
is the number by which to adjust y, (root of 2).  Did you  look at how well
resx correlated with x2 in the simulations?  Would it do this if for the
model y=x1+x2+x3, pooling x2+x3?

What variables could be confounded with bladder cancer?  CR assumes the
dependent variable is continuous. Was cancer a rate per 100,000 or something
like this or yes or no?  How highly correlated is bladder cancer correlated
with alcohol consumption?

I suspect that the reason CC and rde are different in the real data is that
some of the real data is based on subtractive models and the residual is not
accurate.  Something in the calculations is off. These are the only thing
that I can think it could be.

But lets assume that you final conclusion is that CC or rde or what ever
does not reflect causes. Then tell us what you think causes are if not
combinations of variables? If causes are combinations, then why have we got
this mixed result?  Why shouldn't it work?

Would you expect CR to work if we had used data from physics or engineering,
as suggested might be the case by Gottfried?  If it would work for physics,
why not other disciplines?

You see, I have run into many problems that did not work for trivial
reasons, that were later solved by sticking to the logic.  Where is the
error in my logic?

Thanks for this honor you have given me of testing my ideas.  You are a good
man and a good scientist. Now lets get beyond the results section of the
paper and do a discussion that is theory based.  What do the results say
about the constructs?

I will not be presenting the data using minitab tonight. I am tired. We have
enough results to question the validity of CR for now.  And I have class on
servers tomorrow. I need some rest.  Good night you colleagues of science.
May the axe fall as it should.

Best,

Bill Chambers


----- Original Message -----
From: "Simon, Steve, PhD" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Wednesday, September 18, 2002 4:37 PM
Subject: Applying Corresponding Regressions across five data sets


> I've adapted a new statistic, CC, that Dr. Chambers suggested. It is a bit
> unusual in that the variables x and y do not participate in a reciprocal
> fashion. So you need to compute CC both ways. A large negative value for
> cc.y implies that x causes y. A large negative value for cc.x implies that
x
> causes x. It is theoretically possible that cc.y and cc.x could both be
> large negative.
>
> I also got the comment that D is uninterpretable when rde(y) and rde(x)
are
> both positive.
>
> Here is an explanation from Dr. Chambers.
>
> > D is ambiguous without knowledge of the rde values that
> > generate them.  Two positive rdes do not imply causation. I
> > do not know what they imply but I do not see them in causal
> > simulations.
> >
> >  Do present both both rde(y) and rde(x).
> >
> >
> > A zero rde is the expected when the variables are
> > uncorrelated independent variables and partitioned by
> > either of the X variables (NOT by the Y variable).This
> > condition is equal to Thurstone's simple structure. He
> > found that this condition maintains when the data are
> > partitioned by either x variable. See his book Multiple
> > Factor Analysis.
> >
> > Since my last post I have been working out CC using minitab
> > and will post my results for the auto data tonight and the
> > usual y=x1+x2.
> >
> > So far the equation as I have rewritten/rediscovered it
> > just now is:
> >
> > 1. where y=x1+x2 and x1 and x2 are uniform  (This is the
> > causal simulation)
> >
> > 2.  Assume we measure only x1 and y on arbitrary scales
> > (This is the data collection simulation)
> >
> > 3. Convert x1 and y to zscores :  zx and zy  (This is the
> > data preparation stage)
> >
> > 4. Calculate the residual X. by :  ((square root of
> > 2)*zy)-zx1      This produces an estimate of zx2 (r=.99).
> > Call it resx. The square root of 2 reproportions the y
> > variance, giving a much better residual than would zy-zx1.
> > This works for the current model but for different levels
> > of correlation between zx1 and zy a different value than 2
> > might be used .... I do not know.)
> >
> > 5.  find Diff= abs(zx1-resx)
> >
> > 6. calculate extremity of y by Ext=abs(zy)
> >
> > 7 Calculate CC by  correlating  Diff and Ext.
> >
> > It should be negative when the hypothesed variables fit the
> > x and y roles and the model is additive.
> >
> > Try this and see what you think. It may be that rde is just
> > as good or identical. I have not calcuated rde since  my
> > GAUSS program is no longer installed. I originally
> > programmed D in GAUSS.
>
> These changes complicate things and add subjectivity. So it is possible
that
> a critic of CR might interpret these results differently than a proponent
of
> CR.
>
> Also, I may have programmed CC incorrectly. But here goes.
>
> corr.reg <- function(x,y) {
>   ey.x <- resid(lm(y~x))
>   ex.y <- resid(lm(x~y))
>   rde.y <- cor(abs(ex.y),abs(y-mean(y)))
>   rde.x <- cor(abs(ey.x),abs(x-mean(x)))
>   d <- rde.y-rde.x
>
>   z.x <- (x-mean(x))/sqrt(var(x))
>   z.y <- (y-mean(y))/sqrt(var(y))
>   res.x <- sqrt(2)*z.y-z.x
>   res.y <- sqrt(2)*z.x-z.y
>   cc.y <- cor(abs(z.x-res.x),abs(z.y))
>   cc.x <- cor(abs(z.y-res.y),abs(z.x))
>   cat(paste("\n      D =",round(d,2),
>             "\n rde(y) =",round(rde.y,2),
>             "\n rde(x) =",round(rde.x,2),
>             "\n   cc.y =",round(cc.y,2),
>             "\n   cc.x =",round(cc.x,2),"\n\n"))
> }
>
> > x1 <- runif(50)
> > x2 <- runif(50)
> > y <- x1+x2
> >
> > corr.reg(x1,y)
>
>       D = -0.72
>  rde(y) = -0.51
>  rde(x) = 0.21
>    cc.y = -0.51
>    cc.x = 0.21
>
> > corr.reg(x2,y)
>
>       D = -0.79
>  rde(y) = -0.51
>  rde(x) = 0.28
>    cc.y = -0.55
>    cc.x = 0.26
>
> > corr.reg(x1,x2)
>
>       D = 0.07
>  rde(y) = 0.28
>  rde(x) = 0.21
>    cc.y = 0.27
>    cc.x = 0.37
>
> Both D and CC behave nicely here.
>
> >
> > attach(cars)
> > corr.reg(Weight,MPG)
>
>       D = 0.47
>  rde(y) = 0.23
>  rde(x) = -0.24
>    cc.y = 0.84
>    cc.x = 0.9
>
> > corr.reg(Drv.Rat,MPG)
>
>       D = 0.21
>  rde(y) = 0.12
>  rde(x) = -0.1
>    cc.y = 0.1
>    cc.x = -0.2
>
> > corr.reg(HP,MPG)
>
>       D = 0.13
>  rde(y) = -0.08
>  rde(x) = -0.21
>    cc.y = 0.83
>    cc.x = 0.85
>
> > corr.reg(Displ,MPG)
>
>       D = 0.52
>  rde(y) = 0.19
>  rde(x) = -0.33
>    cc.y = 0.73
>    cc.x = 0.81
>
> > corr.reg(Cyl,MPG)
>
>       D = 0.29
>  rde(y) = -0.1
>  rde(x) = -0.39
>    cc.y = 0.72
>    cc.x = 0.7
>
> We've seen this data before. CC implies no causation, except perhaps that
> MPG causes Drive Ratio. The D statistics provide totally counterintuitive
> results. In none of the cases can this be attributed to two positive
values
> for rde.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/homedat.html
> >
> > detach()
> > attach(housing)
> > corr.reg(sqft,price)
>
>       D = -0.4
>  rde(y) = 0.13
>  rde(x) = 0.53
>    cc.y = 0.07
>    cc.x = 0.5
>
> > corr.reg(age[is.finite(age)],price[is.finite(age)])
>
>       D = 0.04
>  rde(y) = 0.13
>  rde(x) = 0.08
>    cc.y = 0.52
>    cc.x = 0.53
>
> > corr.reg(feats,price)
>
>       D = 0.07
>  rde(y) = 0.15
>  rde(x) = 0.08
>    cc.y = 0.25
>    cc.x = 0.21
>
> > corr.reg(nec,price)
>
>       D = -0.12
>  rde(y) = -0.3
>  rde(x) = -0.18
>    cc.y = 0.21
>    cc.x = -0.09
>
> > corr.reg(cust,price)
>
>       D = -0.46
>  rde(y) = -0.04
>  rde(x) = 0.42
>    cc.y = -0.05
>    cc.x = 0.45
>
> > corr.reg(cor,price)
>
>       D = -0.04
>  rde(y) = -0.21
>  rde(x) = -0.17
>    cc.y = 0.42
>    cc.x = 0.4
>
> > corr.reg(cust,sqft)
>
>       D = -0.28
>  rde(y) = 0
>  rde(x) = 0.28
>    cc.y = 0.04
>    cc.x = 0.3
>
> The D values behave nicely here. The size of the house and whether it is
> custom built cause the price. The CC values say nothing at all, unless you
> want to trust a value like -0.05 or -0.09.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html
> >
> > detach()
> > attach(cigarettes)
> > corr.reg(cig,bladder)
>
>       D = 0.53
>  rde(y) = 0.3
>  rde(x) = -0.22
>    cc.y = 0.3
>    cc.x = -0.22
>
> > corr.reg(cig,lung)
>
>       D = 0.26
>  rde(y) = 0.24
>  rde(x) = -0.01
>    cc.y = 0.25
>    cc.x = -0.01
>
> > corr.reg(cig,kidney)
>
>       D = -0.27
>  rde(y) = 0.01
>  rde(x) = 0.28
>    cc.y = 0.09
>    cc.x = 0.36
>
> > corr.reg(cig,leuk)
>
>       D = -0.06
>  rde(y) = -0.06
>  rde(x) = 0
>    cc.y = 0.48
>    cc.x = 0.32
>
> This is about as bad as the car mileage data. Both the CC and D values
imply
> that bladder cancer causes cigarette smoking. The D value implies that
lung
> cancer causes smoking. For kidney cancer, the D statistic gets it right,
but
> we need to throw out this result because D is unreliable when both rde
> values are positive.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/Ageandheight.html
> >
> > child.ht <- c(76.1, 77, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8,
> 82.8, 83.5)
> > child.age <- 18:29
> > corr.reg(child.age,child.ht)
>
>       D = -0.16
>  rde(y) = -0.26
>  rde(x) = -0.11
>    cc.y = 0.76
>    cc.x = 0.83
>
> The D value is good here, as age causes height. The CC value is ambiguous.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/SmokingandCancer.html
> >
> > detach()
> > attach(smoking)
> > corr.reg(Smoking,Mortality)
>
>       D = -0.26
>  rde(y) = -0.23
>  rde(x) = 0.04
>    cc.y = -0.22
>    cc.x = 0.04
>
> > detach()
> >
>
> Good! The D value and the CC value both imply that smoking causes
mortality.
>
> Everyone will have a different conclusion, perhaps, but my conclusion is
> that the currently defined statistics of D and CC are not helpful in
> identifying causes across a wide variety of data sets. D appears to behave
a
> bit better than CC, but still provides highly counter-intuitive findings
in
> two of the five data sets.
>
> Feel free to offer your own interpretations. And if there is an error in
my
> code, please let me know.
>
> Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
> The STATS web page has moved to
> http://www.childrens-mercy.org/stats.
>
>
"Simon, Steve, PhD" <[EMAIL PROTECTED]> wrote in message
E7AC96207335D411B1E7009027FC284902A9B282@EXCHANGE2">news:E7AC96207335D411B1E7009027FC284902A9B282@EXCHANGE2...
> I've adapted a new statistic, CC, that Dr. Chambers suggested. It is a bit
> unusual in that the variables x and y do not participate in a reciprocal
> fashion. So you need to compute CC both ways. A large negative value for
> cc.y implies that x causes y. A large negative value for cc.x implies that
x
> causes x. It is theoretically possible that cc.y and cc.x could both be
> large negative.
>
> I also got the comment that D is uninterpretable when rde(y) and rde(x)
are
> both positive.
>
> Here is an explanation from Dr. Chambers.
>
> > D is ambiguous without knowledge of the rde values that
> > generate them.  Two positive rdes do not imply causation. I
> > do not know what they imply but I do not see them in causal
> > simulations.
> >
> >  Do present both both rde(y) and rde(x).
> >
> >
> > A zero rde is the expected when the variables are
> > uncorrelated independent variables and partitioned by
> > either of the X variables (NOT by the Y variable).This
> > condition is equal to Thurstone's simple structure. He
> > found that this condition maintains when the data are
> > partitioned by either x variable. See his book Multiple
> > Factor Analysis.
> >
> > Since my last post I have been working out CC using minitab
> > and will post my results for the auto data tonight and the
> > usual y=x1+x2.
> >
> > So far the equation as I have rewritten/rediscovered it
> > just now is:
> >
> > 1. where y=x1+x2 and x1 and x2 are uniform  (This is the
> > causal simulation)
> >
> > 2.  Assume we measure only x1 and y on arbitrary scales
> > (This is the data collection simulation)
> >
> > 3. Convert x1 and y to zscores :  zx and zy  (This is the
> > data preparation stage)
> >
> > 4. Calculate the residual X. by :  ((square root of
> > 2)*zy)-zx1      This produces an estimate of zx2 (r=.99).
> > Call it resx. The square root of 2 reproportions the y
> > variance, giving a much better residual than would zy-zx1.
> > This works for the current model but for different levels
> > of correlation between zx1 and zy a different value than 2
> > might be used .... I do not know.)
> >
> > 5.  find Diff= abs(zx1-resx)
> >
> > 6. calculate extremity of y by Ext=abs(zy)
> >
> > 7 Calculate CC by  correlating  Diff and Ext.
> >
> > It should be negative when the hypothesed variables fit the
> > x and y roles and the model is additive.
> >
> > Try this and see what you think. It may be that rde is just
> > as good or identical. I have not calcuated rde since  my
> > GAUSS program is no longer installed. I originally
> > programmed D in GAUSS.
>
> These changes complicate things and add subjectivity. So it is possible
that
> a critic of CR might interpret these results differently than a proponent
of
> CR.
>
> Also, I may have programmed CC incorrectly. But here goes.
>
> corr.reg <- function(x,y) {
>   ey.x <- resid(lm(y~x))
>   ex.y <- resid(lm(x~y))
>   rde.y <- cor(abs(ex.y),abs(y-mean(y)))
>   rde.x <- cor(abs(ey.x),abs(x-mean(x)))
>   d <- rde.y-rde.x
>
>   z.x <- (x-mean(x))/sqrt(var(x))
>   z.y <- (y-mean(y))/sqrt(var(y))
>   res.x <- sqrt(2)*z.y-z.x
>   res.y <- sqrt(2)*z.x-z.y
>   cc.y <- cor(abs(z.x-res.x),abs(z.y))
>   cc.x <- cor(abs(z.y-res.y),abs(z.x))
>   cat(paste("\n      D =",round(d,2),
>             "\n rde(y) =",round(rde.y,2),
>             "\n rde(x) =",round(rde.x,2),
>             "\n   cc.y =",round(cc.y,2),
>             "\n   cc.x =",round(cc.x,2),"\n\n"))
> }
>
> > x1 <- runif(50)
> > x2 <- runif(50)
> > y <- x1+x2
> >
> > corr.reg(x1,y)
>
>       D = -0.72
>  rde(y) = -0.51
>  rde(x) = 0.21
>    cc.y = -0.51
>    cc.x = 0.21
>
> > corr.reg(x2,y)
>
>       D = -0.79
>  rde(y) = -0.51
>  rde(x) = 0.28
>    cc.y = -0.55
>    cc.x = 0.26
>
> > corr.reg(x1,x2)
>
>       D = 0.07
>  rde(y) = 0.28
>  rde(x) = 0.21
>    cc.y = 0.27
>    cc.x = 0.37
>
> Both D and CC behave nicely here.
>
> >
> > attach(cars)
> > corr.reg(Weight,MPG)
>
>       D = 0.47
>  rde(y) = 0.23
>  rde(x) = -0.24
>    cc.y = 0.84
>    cc.x = 0.9
>
> > corr.reg(Drv.Rat,MPG)
>
>       D = 0.21
>  rde(y) = 0.12
>  rde(x) = -0.1
>    cc.y = 0.1
>    cc.x = -0.2
>
> > corr.reg(HP,MPG)
>
>       D = 0.13
>  rde(y) = -0.08
>  rde(x) = -0.21
>    cc.y = 0.83
>    cc.x = 0.85
>
> > corr.reg(Displ,MPG)
>
>       D = 0.52
>  rde(y) = 0.19
>  rde(x) = -0.33
>    cc.y = 0.73
>    cc.x = 0.81
>
> > corr.reg(Cyl,MPG)
>
>       D = 0.29
>  rde(y) = -0.1
>  rde(x) = -0.39
>    cc.y = 0.72
>    cc.x = 0.7
>
> We've seen this data before. CC implies no causation, except perhaps that
> MPG causes Drive Ratio. The D statistics provide totally counterintuitive
> results. In none of the cases can this be attributed to two positive
values
> for rde.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/homedat.html
> >
> > detach()
> > attach(housing)
> > corr.reg(sqft,price)
>
>       D = -0.4
>  rde(y) = 0.13
>  rde(x) = 0.53
>    cc.y = 0.07
>    cc.x = 0.5
>
> > corr.reg(age[is.finite(age)],price[is.finite(age)])
>
>       D = 0.04
>  rde(y) = 0.13
>  rde(x) = 0.08
>    cc.y = 0.52
>    cc.x = 0.53
>
> > corr.reg(feats,price)
>
>       D = 0.07
>  rde(y) = 0.15
>  rde(x) = 0.08
>    cc.y = 0.25
>    cc.x = 0.21
>
> > corr.reg(nec,price)
>
>       D = -0.12
>  rde(y) = -0.3
>  rde(x) = -0.18
>    cc.y = 0.21
>    cc.x = -0.09
>
> > corr.reg(cust,price)
>
>       D = -0.46
>  rde(y) = -0.04
>  rde(x) = 0.42
>    cc.y = -0.05
>    cc.x = 0.45
>
> > corr.reg(cor,price)
>
>       D = -0.04
>  rde(y) = -0.21
>  rde(x) = -0.17
>    cc.y = 0.42
>    cc.x = 0.4
>
> > corr.reg(cust,sqft)
>
>       D = -0.28
>  rde(y) = 0
>  rde(x) = 0.28
>    cc.y = 0.04
>    cc.x = 0.3
>
> The D values behave nicely here. The size of the house and whether it is
> custom built cause the price. The CC values say nothing at all, unless you
> want to trust a value like -0.05 or -0.09.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/cigcancerdat.html
> >
> > detach()
> > attach(cigarettes)
> > corr.reg(cig,bladder)
>
>       D = 0.53
>  rde(y) = 0.3
>  rde(x) = -0.22
>    cc.y = 0.3
>    cc.x = -0.22
>
> > corr.reg(cig,lung)
>
>       D = 0.26
>  rde(y) = 0.24
>  rde(x) = -0.01
>    cc.y = 0.25
>    cc.x = -0.01
>
> > corr.reg(cig,kidney)
>
>       D = -0.27
>  rde(y) = 0.01
>  rde(x) = 0.28
>    cc.y = 0.09
>    cc.x = 0.36
>
> > corr.reg(cig,leuk)
>
>       D = -0.06
>  rde(y) = -0.06
>  rde(x) = 0
>    cc.y = 0.48
>    cc.x = 0.32
>
> This is about as bad as the car mileage data. Both the CC and D values
imply
> that bladder cancer causes cigarette smoking. The D value implies that
lung
> cancer causes smoking. For kidney cancer, the D statistic gets it right,
but
> we need to throw out this result because D is unreliable when both rde
> values are positive.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/Ageandheight.html
> >
> > child.ht <- c(76.1, 77, 78.1, 78.2, 78.8, 79.7, 79.9, 81.1, 81.2, 81.8,
> 82.8, 83.5)
> > child.age <- 18:29
> > corr.reg(child.age,child.ht)
>
>       D = -0.16
>  rde(y) = -0.26
>  rde(x) = -0.11
>    cc.y = 0.76
>    cc.x = 0.83
>
> The D value is good here, as age causes height. The CC value is ambiguous.
>
> >
> > # http://lib.stat.cmu.edu/DASL/Datafiles/SmokingandCancer.html
> >
> > detach()
> > attach(smoking)
> > corr.reg(Smoking,Mortality)
>
>       D = -0.26
>  rde(y) = -0.23
>  rde(x) = 0.04
>    cc.y = -0.22
>    cc.x = 0.04
>
> > detach()
> >
>
> Good! The D value and the CC value both imply that smoking causes
mortality.
>
> Everyone will have a different conclusion, perhaps, but my conclusion is
> that the currently defined statistics of D and CC are not helpful in
> identifying causes across a wide variety of data sets. D appears to behave
a
> bit better than CC, but still provides highly counter-intuitive findings
in
> two of the five data sets.
>
> Feel free to offer your own interpretations. And if there is an error in
my
> code, please let me know.
>
> Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
> The STATS web page has moved to
> http://www.childrens-mercy.org/stats.
>
>
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> .                  http://jse.stat.ncsu.edu/                    .
> =================================================================



.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Applying Corresponding Regressions across five data sets

Reply via email to