We are making quite a bit of progress. We are finding more and more evidence
that CR does not work with real world data sets. The following types of
data, according to Dr. Chambers, are not good for CR.

1. Convenience samples. That pretty much eliminates the ability of CR to
work in any study that requires informed consent, because restricting your
sample to only those who give informed consent makes your sample a
convenience sample. It eliminates data sets where the study is restricted to
a single hospital rather than a representative sample of hospitals. It
eliminates data sets where we need financial inducements to get people to
participate.

2. Data sets with confounders. That pretty much eliminates any Epidemiology
data set. And we'll never be able to use CR to understand the environmental
and hereditary causes of cancer, because cancer data has too many
confounders.

3. Data sets where the correlation is too high. This eliminates a lot of the
physical sciences. I know in Chemistry, that they get disappointed if the
correlation in their calibration experiments is not at least 0.98.

Now I wonder how you would design an experiment to keep the correlation from
being too high? I suppose you could deliberately be sloppy and hope that
this introduces extra error into the process.

In medical applications, we are totally without hope. Birth weight and
gestational age are highly correlated, and we can do nothing to remove this
correlation. We can't command mothers to have 4000 gram/26 week babies or
500 gram/38 week babies. It just isn't going to happen.
 
Most statisticians are delighted to get a very high correlation. The
strength of the correlation is one of the nine conditions that Hill set out
in 1965 to establish a cause and effect relationship.

4. You have to have enough data at the extremes. We might be able to fix
this if we trim the data, but this has been shown to work only in
simulations.

And you have to be careful what you remove. If you remove the data by
trimming the edges, that works, according to Dr. Chambers. But if you remove
the data by creating evenly spaced bins on a rectangular grid and then
selecting the first observation to fall in each bin, then that makes CR
worse, according to Dr. Chambers.

I have very little faith in the trimming approach. Selectively removing data
values based on their extremities is asking for trouble. It will create all
sorts of artefactual problems. And it will do nothing to fix all the other
problems listed in this email.

5. If you have two possible causes, you need to sample so that all four
corners of the square are filled. If you have three possible causes, you
need to sample so that all eight corners of the cube are filled.

6. You have to have a linear relationship. Transforming the data so that the
relationship becomes linear may or may not work.

There are other restrictions that I believe should be added to the list.

7. Heteroscedascity will almost certainly ruin CR. 

8. CR does not work well with a discrete cause variable. It fails miserably
when the cause variable is binary. I suspect it will perform poorly when the
cause variable only has three or four levels.

9. We can also rule out any data set where the effect is binary. We can
never use CR to establish causes of mortality for example, because there is
no middle category for mortality.

As additional real data sets fail, I'm sure we will see additional reasons
added to this list. It may come down to this. CR only works for a data set
that is carefully and meticulously designed from scratch to meet the
extremely rigorous demands of the method. For simplicity, I'll call this a
C-sample.

To prove or disprove that CR works with a C-sample, we would have to collect
some data from scratch. That's too expensive and time consuming for me to
do. But by showing sharply limiting the number of real world data sets that
we can apply CR to, I will be performing some service.

And while I am still skeptical of simulations, Dr. Chamber's comments that
CR works better for moderate rather than strong correlations is indeed
supported by a simple simulation.

Take the existing data set, and estimate the residual and predicted values.
Recombine the residual and the predicted value by reweighting the residual
by a factor of 10, 30, or 100. We get a data set that is similar to the
original data set, but with much more error in the data. It appears that a
correlation of 0.22 works better with CR than a correlation around 0.56 or
0.07.

> pred.resp <- predict(lm(resp~dose))
> resid.resp <- resid(lm(resp~dose))
> resp1 <- pred.resp+10*resid.resp
> resp2 <- pred.resp+30*resid.resp
> resp3 <- pred.resp+100*resid.resp
> cor(dose,cbind(resp,resp1,resp2,resp3))
     resp     resp1     resp2      resp3 
 0.989066 0.5570023 0.2181726 0.06691709
>
> corr.reg(dose,resp)

      D = 0.16 
 rde(y) = 0.06 
 rde(x) = -0.09 
   cc.y = 0.78 
   cc.x = 0.52 

> corr.reg(dose,resp1)

      D = -0.05 
 rde(y) = -0.14 
 rde(x) = -0.09 
   cc.y = 0 
   cc.x = -0.02 

> corr.reg(dose,resp2)

      D = -0.23 
 rde(y) = -0.32 
 rde(x) = -0.09 
   cc.y = 0.14 
   cc.x = 0.25 

> corr.reg(dose,resp3)

      D = -0.08 
 rde(y) = -0.17 
 rde(x) = -0.09 
   cc.y = 0.21 
   cc.x = 0.35 

So where do we stand? I've pretty much decided that CR is useless for the
data sets that I encounter in my job. There may be some real data sets out
there where CR works, but I'm starting to lose hope. Trimming is a waste of
time, in my opinion. If it works at all, it only overcomes one of the many
limitations of CR. Trimming a data set won't remove any hidden confounders,
for example.

Testing CR on a C-sample might be worthwhile, but we'd have to think of an
experiment we could design that wouldn't cost a lot of money or take a lot
of time.

Steve Simon, [EMAIL PROTECTED], Standard Disclaimer.
The STATS web page has moved to
http://www.childrens-mercy.org/stats
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to