Here's a response to the two people who have replied to the list
about my query. (Thanks heaps for your input. This list is
wonderful. If it ever loses its institutional support, and noone
else wants to pick it up, I will. I'd run it with listproc, and we
would have moderators to filter out the spam.)
At 6:32 PM -0500 15/1/01, Bob Wheeler wrote:
>In practice the observed residuals are highly
>correlated and, if the design is a good one,
>fluctuate in a small space with few degrees of
>freedom. Applying any test for non-normality to
>such observed residuals is fairly futile.
The test for normality is a test of distribution of magnitudes, not
independence of the errors or of the residuals. Yes, the residuals
are correlated, but that may have no bearing on the normality of
their distribution. If you fit a straight line to three points drawn
from a population with a correlation between two normally distributed
variables, are the residuals normally distributed? I guess they must
be, or the regression analysis wouldn't give correct confidence
limits. (BTW, "highly" correlated is surely not correct for analyses
with a few estimated parameters and many degrees of freedom. And I'm
not sure why a good design would be one in which the residuals had
few degrees of freedom. The bigger the sample size, the better,
except when you end up with more precision for your effects than you
need.)
It is probably futile to TEST for non-normality whatever the sample
size, because, as Robert Dawson pointed out, large samples usually
test positive for non-normality even when the residuals look
reasonably normal, whereas small samples usually test negative even
when the residuals are quite non-normal. But it is not futile to
ESTIMATE non-normality for large sample sizes, if you know the
magnitude of non-normality that starts to screw up your estimates and
their confidence limits. And it may not be futile to estimate
non-normality for small sample sizes either, depending on how you
think the residuals should be distributed. For example, you may have
good reasons for doing a log transformation, so check the residuals
after log transformation. If they have a higher normality score than
the residuals from the raw variable, fine, even though both sets of
residuals aren't statistically significantly different from normal.
What will matter is HOW non-normal the residuals are. My question
about the magnitude of deviation from non-normality still stands.
At 10:41 AM -0400 16/1/01, Robert J. MacG. Dawson wrote:
> There are those who would omit the word "small" from this; myself, I am
>prepared to use a large data set as evidence of its own approximate
>normality, largely because when the data set is large, "approximate
>normality" may be very approximate indeed, as the Central Limit Theorem
>will take care of almost anything. For large N, the t test is
>essentially nonparametric.
Are you suggesting that you don't need transformation for large
sample sizes? I think you have a popular misconception about the
central limit theorem. Sure, the mean of a large sample is normally
distributed, whatever the parent distribution, but that's not the
issue. It's the residuals that have to be normal. If your variable
is non-normally distributed, it doesn't matter how big your sample
size is: the precision of your estimates based on the raw variable
will never be correct. Or to put it another way, you will find a
substantial difference between the estimates from the untransformed
vs the transformed data. Which estimates do you use? The ones from
the analysis that has residuals closer to normality.
> I would suggest using boxplots to spot very skewed or heavy-tailed
>samples;
You see, you are using a qualitative estimate of non-normality! I
want a rule based on a quantitative estimate.
Will
=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================