On Sun, 2 Jan 2000, bkamen wrote:
> This practical question arose between myself and a colleague at work.
> It concerns whether we can use correlation analysis if one of the
> variables is non-continuous or "categorical." She believes that both
> variables must be continuous. However she cannot say why, and I cannot
> find any such constraint in the statistics book I have relied on since
> graduating in Industrial Engineering a few years ago, Miller and Freund,
> 'Probability and Statistics for Engineers.'
Depends on the categorical variable. If it really is categories only,
of nominal scale, a correlation coefficient is meaningless. If the
categories are at least ordered, a correlation has some meaning. Some
folks will insist that the variables be not only ordered but of interval
scale as well; if one is feeling punctilious, the correlation may be
interpreted with a prefatory "If one assumes equal intervals in the
variable", or words to that general effect.
> I have been thinking that if x is discrete and can assume only a few
> values compared with y which is continuous, the correlation study may
> yield a high probability of type-one error. I interpret this as
> providing insufficient evidence with which to reject the null
> hypothesis.
Non sequitur. Depends on the relationship between Y and X. As is
probably obvious, no-one should attempt to interpret a correlation
cofficient without having first inspected the relevant scatterplot, and
being able to state unequivocally that the correlation coefficient fairly
reflects the degree of association between the variables.
> But I have not thought of this as an inappropriate use of correlation.
Again, depends on whether the categories be at least ordered.
> On the other hand in attempting to probe Miller and Freund I find that
> correlation is based on the "bivariate normal distribution," the
> formula for which has numerous parameters including alpha and beta, the
> least squares regression coefficients. I am aware that to obtain the
> latter requires that the function be differentiable, hence x must also
> be continuous. This seems to support my friend's view.
Somewhat mis-stated. Correlation -- that is, the computation of a
product-moment correlation coefficient -- does not entail any
distributional requirements, except that the bivariate relationship be at
least approximately linear. (If it is not, the correlation coefficient
will understate the 'real' degree of the relationship.) However, to test
an hypothesis one must make some distributional assumptions, and the
standard assumption (for the standard test) is that the underlying
relationship is bivariate normal.
Least-squares coefficients do not require that the function be
"differentiable, hence continuous": the procedure only requires that it
be possible to find a minimum of an expression for the sum of squared
residuals. Since no real data is in fact continuous, strictly speaking,
it's a good thing that one need not require the variables to be,
strictly, continuous. (One may wish to assume, in the absence of
evidence to the contrary (and an interesting task it is to imagine what
evidence would be relevant), that the latent variable (of which the
observed variable is presumed to be an approximation) is continuous.
But one cannot compute a correlation with a latent variable.)
> I would appreciate clarification of any such constraints on the
> practical use of correlation analysis. Also, if anyone can recommend a
> textbook that addresses questions such as this more directly than Miller
> and Freund, I would appreciate that also.
------------------------------------------------------------------------
Donald F. Burrill [EMAIL PROTECTED]
348 Hyde Hall, Plymouth State College, [EMAIL PROTECTED]
MSC #29, Plymouth, NH 03264 603-535-2597
184 Nashua Road, Bedford, NH 03110 603-471-7128