In the mode of addressing the relevance of the trope correlation doesn't
imply causation to the issue of governments testing social theories on
unwilling human subjects:
About 15 years ago, I was working at a well-funded Silicon Valley startup
in Palo Alto with about 100 people. During the few years I was there, 5
children of parents working there were diagnosed on the autism spectrum. I
contacted the Berkeley epidemiologist who had been studying autism and
informed him of the anomaly. His response was simply that We know such
clusters exist in Silicon Valley and we don't know what causes them.
Well, DUH!
I was outraged.
Several years later I was able to get data out of the Dept of Education on
the incidence of autism by State. I could not locate data by county. So I
did what any _reasonable epidemiologist should_ do with such data: I
surveyed the list of current hypotheses of causes of autism, added a few,
and gathered State-level data on other variables related to those
hypotheses to look for *gasp* CORRELATIONS.
Now, none of this would be in the _least_ controversial, except that one of
the hypotheses was that the recent increase in immigration from India to
places like Silicon Valley was bringing in a pathogen -- possibly
intestinal -- being spread in some manner such as restaurants. Moreover,
the project wouldn't have been controversial even then because if you look
at the rank-order of single-variable correlations, the correlation with
immigrants from India doesn't beat mother's age at first live birth (one
hypothesis is father's age producing errors in the sperm's DNA -- for which
MAAFLB is a proxy). However, if we're looking at a population with high
susceptibility -- say genetic background from human ecologies with low
population densities -- then you have to construct a composite variable as
the conjunction between the susceptible population and the vector
population. L
Lo and behold, when all 2-variable conjunctions were correlated with autism
incidence, the pair that came out on top was immigrants from India per
capita and Finnish ancestry per capita.
NOW we're in serious trouble for oblivious political reasons!
So I added hundreds more demographic variables to see if, by chance, I
could get some pairs of variables to beat that pair -- not that this would,
by itself, invalidate the correlation; such scatter-shot searches for
correlations are notorious as a statistical fallacy called data-mining in
which you have no idea of what class of correlations you're looking for
and, just by pure chance, you can expect to find some ranking higher so you
can't automatically conclude they are significant even though the Pearson's
'r' and degrees of freedom (sample size) -- taken out of the data-mining
context -- might indicate high significance. What I found was that,
indeed, there were higher correlation pairs but in the scatter plot for the
correlation in question, there were some data points that seemed as
particular statistical outliers. This is a common problem in science and
it can result from a large number of things -- but usually some kind of
measurement error. It is standard procedure, in such scenarios, to throw
out the top and bottom measurements -- thereby reducing the sample size but
hopefully ending up with a higher quality sample. Doing that, the India
immigrant x Finnish ancestry pair once again topped the list which now
included a combinatorial explosion of pairs.
So we're still far from out of the scientific woods (let alone political
woods) with this since there the single variable correlation with
mother's age at first live birth is nipping at the heels of the politically
volatile correlation. Moreover, the MAAFLB scatter plot is more 'normal'
or 'robust', meaning that the data points spread out relatively evenly
around the regression line, whereas the politically volatile correlation is
ragged -- far from 'normal'. You can try to discount the ragged
correlation scatter and keep the high rank for the politically volatile
correlation by invoking confounding variables such as differing standards
of autism diagnosis applied across different states, etc. However, the
fact remains that the MAAFLB correlation is less complicated (single
variable) and is more robust.
OK, so where does this leave us?
Well, if I were forced to choose one hypothesis as a working hypothesis I'd
say father's age is the correct hypothesis -- not because it avoids the
nasty politics of immigration -- but simply on standard statistical merits.
However, life isn't so kind to us as to allow us to ignore all alternative
hypotheses -- even when those hypotheses might be considered Hate Data.
This is particularly true when you have something as devastating to
families, already struggling with the disappearance of middle class jobs,
as autism mysteriously exploding in incidence.
But it gets worse:
Once I had this database of hundreds of by-State demographic variables, I
decided to -- just