Lauren wrote:
> 
> Hello,
> I am interested in calculating the correlation coefficient between two
> variables, both continuous quantitative variables between 0 and 1.
> However, I noticed that the dependent variable is nearly binary with a
> large amount of its values at o and 1 with only a small fraction
> in-between.  

        In which case it's not a continuous distribution, it's a mixed
distribution. 

        It strongly suggests that something is unusual in the process by
which the data are generated. Some possibilities are:

        (a) two variables, a Bernoulli and a continuous one, are being mixed in
some proportion. EG: a working logic circuit will always be measured at
0V or 5V, while a burned-out one may take any value in between.

        (b) a "clamping" process in which the variable is unable to go beyond a
certain range but "would like to". Example: transverse position of a
bowling ball at the end of the lane in a beginners' game. A high
proportion will be in one gutter or the other, the rest uniformly
distributed across the lane.

        (c) some combination: eg, a Grade 3 spelling test administered in
English to a group of adults, some of whom know no English.

<flame level="mild"> 
        PLEASE, people, when you post a question about a data set to EDSTAT-L,
PLEASE explain as much as you know about the source.  Posting a sketchy
description is like going to the doctor and saying that you have a
friend who thinks she might be pregnant...
        We're not going to steal your data; and if it's really so confidential
that you may not discuss it, then (in my non-lawyer opinion) you ought
not to be saying anything about it at all without a nondisclosure
agreement made binding by a contract involving the transfer of at least
a token fee. 
         Not to me please, I'm too busy to take on consultation. Little matter
of a hurricane here last week...
</flame>

        I would suggest that while you *can* compute the correlation
coefficient, its definition is based upon the idea that deviation is
best measured by the sum of squared differences from the mean.  This was
*not* written in small print on the back of the Two Tablets that Moses
brought down the mountain; in normal use, it comes from the idea of
likelihood, and from the fact that for independent homoskedastic normal
errors, the loglikelihood is proportional to the sum of squared errors.

        In a model such as you describe, the normal error model is obviously
not applicable (or even approximately so) so neither is the correlation
coefficient.
In particular, the computed significance level for r^2 for whatever
number of data you have is certainly bogus.

        So, what do I recommend?   You may need to do two or three analyses,
perhaps a first one determining which measurements are Bernoulli and
which are continuous; then one determining Bernoulli outcome; then one
on the continuous data (which might involve correlation or might not.) 
I can't tell you exactly 
wht to do, but remember:

    Thou Shalt Understand Thy Data Source.

    Thou Shalt Not Keep Thy Consultants In The Dark, Not Even Casual
Ones.

        -Robert Dawson
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to