On Sun, 2 Jan 2000, bkamen wrote:

> This practical question arose between myself and a colleague at work.  
> It concerns whether we can use correlation analysis if one of the 
> variables is non-continuous or "categorical."  She believes that both 
> variables must be continuous.  However she cannot say why, and I cannot 
> find any such constraint in the statistics book I have relied on since 
> graduating in Industrial Engineering a few years ago, Miller and Freund, 
> 'Probability and Statistics for Engineers.'

Depends on the categorical variable.  If it really is categories only, 
of nominal scale, a correlation coefficient is meaningless.  If the 
categories are at least ordered, a correlation has some meaning.  Some 
folks will insist that the variables be not only ordered but of interval 
scale as well;  if one is feeling punctilious, the correlation may be 
interpreted with a prefatory "If one assumes equal intervals in the 
variable", or words to that general effect.
 
> I have been thinking that if x is discrete and can assume only a few 
> values compared with y which is continuous, the correlation study may 
> yield a high probability of type-one error.  I interpret this as 
> providing insufficient evidence with which to reject the null 
> hypothesis. 

Non sequitur.  Depends on the relationship between Y and X.  As is 
probably obvious, no-one should attempt to interpret a correlation 
cofficient without having first inspected the relevant scatterplot, and 
being able to state unequivocally that the correlation coefficient fairly 
reflects the degree of association between the variables.

> But I have not thought of this as an inappropriate use of correlation. 

Again, depends on whether the categories be at least ordered.

> On the other hand in attempting to probe Miller and Freund I find that 
> correlation is based on the "bivariate normal distribution,"  the 
> formula for which has numerous parameters including alpha and beta, the 
> least squares regression coefficients.  I am aware that to obtain the 
> latter requires that the function be differentiable, hence x must also 
> be continuous.  This seems to support my friend's view.

Somewhat mis-stated.  Correlation -- that is, the computation of a 
product-moment correlation coefficient -- does not entail any 
distributional requirements, except that the bivariate relationship be at 
least approximately linear.  (If it is not, the correlation coefficient 
will understate the 'real' degree of the relationship.)  However, to test 
an hypothesis one must make some distributional assumptions, and the 
standard assumption (for the standard test) is that the underlying 
relationship is bivariate normal.  
        Least-squares coefficients do not require that the function be 
"differentiable, hence continuous":  the procedure only requires that it 
be possible to find a minimum of an expression for the sum of squared 
residuals.  Since no real data is in fact continuous, strictly speaking, 
it's a good thing that one need not require the variables to be, 
strictly, continuous.  (One may wish to assume, in the absence of 
evidence to the contrary (and an interesting task it is to imagine what 
evidence would be relevant), that the latent variable (of which the 
observed variable is presumed to be an approximation) is continuous. 
But one cannot compute a correlation with a latent variable.)

> I would appreciate clarification of any such constraints on the 
> practical use of correlation analysis.  Also, if anyone can recommend a 
> textbook that addresses questions such as this more directly than Miller 
> and Freund, I would appreciate that also.

 ------------------------------------------------------------------------
 Donald F. Burrill                                 [EMAIL PROTECTED]
 348 Hyde Hall, Plymouth State College,          [EMAIL PROTECTED]
 MSC #29, Plymouth, NH 03264                                 603-535-2597
 184 Nashua Road, Bedford, NH 03110                          603-471-7128  

Reply via email to