> 
> At 10:41 AM -0400 16/1/01, Robert J. MacG. Dawson wrote:
> >       There are those who would omit the word "small" from this; myself, I am
> >prepared to use a large data set as evidence of its own approximate
> >normality, largely because when the data set is large, "approximate
> >normality" may be very approximate indeed, as the Central Limit Theorem
> >will take care of almost anything. For large N, the t test is
> >essentially nonparametric.

and Will Hopkins replied:

> Are you suggesting that you don't need transformation for large
> sample sizes?  I think you have a popular misconception about the
> central limit theorem.  Sure, the mean of a large sample is normally
> distributed, whatever the parent distribution, but that's not the
> issue.  It's the residuals that have to be normal.

        No, the residuals do *not* have to be normal; the sampling
distribution of (Xbar - mu)/(s/sqrt(n)) is close to Student's t
distribution with n-1 DOF for most distributions, provided n is large
enough.  (There are, of course, exceptions.)

        The main point of transformation, in such cases, is to provide
inferences about a more appropriate location parameter, not to provide
more appropriate inferences about the mean.


                                                 If your variable
> is non-normally distributed, it doesn't matter how big your sample
> size is: the precision of your estimates based on the raw variable
> will never be correct.  Or to put it another way, you will find a
> substantial difference between the estimates from the untransformed
> vs the transformed data.  Which estimates do you use?  The ones from
> the analysis that has residuals closer to normality.

        I would agree with this because that analysis is usually about the more
appropriate parameter. However, if you have a special interest in the
mean (say you want to estimate the *total* weight of 100,000 potatoes
based on a sample of 50, so that your problem is inherently additive)
you do not want to  transform and study (say) the geometric mean, even
if the potato weights are more symmetric after transformation. Nor do
you want to use rank based methods and study the median. Neither of
those parameters will give you a good estimate for the weight of 100,000
spuds when you multiply by 2000.


> 
> >       I would suggest using boxplots to spot very skewed or heavy-tailed
> >samples;
> 
> You see, you are using a qualitative estimate of non-normality!  I
> want a rule based on a quantitative estimate.

        Don't know of one. I guess you'd need a measure of non-normality that
put stress on the third and fourth moments, as those are the ones that
vanish most slowly in the sample mean as N increases. Maybe |skewness|/
sqrt(N) plus kurtosis/N or something would work, but I can't say it
looks like a very productive question to me.
        
        Besides, there is the usual problem that the sampling distribution of
such a measure is more or less by definition very dispersed - the old
"sending a boy out in a rowboat to see if it's safe for a liner to leave
port" problem.

        
        -Robert Dawson


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to