Selections from an earlier exchange on EdStat-L are followed by some
comments.
----- Forwarded message from Donald Burrill -----
On Sat, 5 Aug 2000, Gates, Christopher [OMP] wrote:
> Donald, thank you so much for your response. I had the opportunity to
> converse with my friend (HOH) on this matter again, and his explanation
> seemed to closely follow yours, or at least that's how I see it.
>
> I guess the bottom line for me is that the assumption for normality in
> the t-test relates to the [population of] sample averages (if one could
> get such measures) which, regardless of the type of sample distribution,
> are probably (CLT) approximately normally distributed for n > 4? 30?
The approximation is asymptotically better for larger n, of course.
> It would seem to me then, that for almost any usual (>5) sample size
> for a t-test, there is probably little need to do any testing of the
> normality of the sample since by the CLT the averages are probably
> going to be normally distributed.
There is almost certainly little _utility_ to so doing; for small sample
sizes tests for normality (or for that matter any other distribution)
have little power.
> Is this an acceptable position to take?
"Acceptable" I don't know about: depends on the universe of discourse.
But you should try to justify the assumption that the observations are
taken independently, and that the underlying within-group variances are
approximately equal. And you should also be aware that while the t-test
is well known to be fairly robust against violations of assumptions,
that robustness applies to two-sided tests; one-sided tests are, by
comparison, rather fragile.
------------------------------------------------------------------------
Donald F. Burrill [EMAIL PROTECTED]
348 Hyde Hall, Plymouth State College, [EMAIL PROTECTED]
MSC #29, Plymouth, NH 03264 603-535-2597
184 Nashua Road, Bedford, NH 03110 603-471-7128
----- End of forwarded message from Donald Burrill -----
This is meant to supplement Don's comments rather than to contradict
them.
First I put "[]" about an occurrence of the phrase "population of" that
is a bit confusing. It's best not to refer to anything other than THE
population as A population.
For very large samples you can use the normal distribution to do
confidence intervals and hypothesis tests. For most distributions
encountered in practice, the CLT will save you even if the population
distribution is fairly weird. For example, you can do confidence
intervals and hypothesis tests for proportions, for which the
population is essentially a 0-1 distribution, which is far from
normal. Here I am thinking of sample sizes common in, say, a Gallup
poll -- 1500 to 5000. For smaller samples, the closeness of this
approximation depends on how close the population is to a normal
distribution. If it is EXACTLY normal, then the means of samples of
size 1 are normally distributed. Despite attempts to legislate "magic
numbers" for how big the sample has to be for this approximation to
work, the only honest answer is "it depends". In terms of the
discussion above, what we need is to have the sampling distribution of
the mean close to normal, but for small samples, that only happens if
the population distribution is close to normal.
In addition to the approximation involved in using the CLT, most
(possibly all) practical situations require that you estimate the
population standard deviation with the sample standard deviation in
calculating a standard error for use in constructing a confidence
interval or doing a hypothesis test. This introduces additional
error. Again, the error is small for large samples. For smaller
samples, it can be fairly large. The usual way around that problem is
to use the t distribution, which you can think of as a modified normal
distribution -- the modifications being those needed to exactly offset
this source of error. The trouble is, in order to calculate those
corrections, we need to know the shape of the population
distribution. The corrections incorporated into the t-distribution
are those appropriate for a normal distribution. So, when we use the
t-distribution, we need to have the population close to normally
distributed in order for the usual test statistic to have a
t(not z)-distribution.
In no case is there any "assumption" about the distribution of the
data in the sample. It is what it is. If you simulate some samples
of size five from a normal distribution, you will find a lot of
variety, with very few samples looking remotely "bell-shaped".
Even so, it is always a good idea to LOOK AT YOUR DATA. Even with a
sample size as small as five, you may be able to see signs that the
population is NOT normally distributed. For example, here is a
dataset on prizes awarded in golf tournaments during 1991, analyzed in
Minitab.
MTB > Retrieve 'E:\STATS\MINITAB8\STATS1A\MISC\GOLF91.MTW'.
Retrieving worksheet from file: E:\STATS\MINITAB8\STATS1A\MISC\GOLF91.MTW
Worksheet was saved on 1/18/1996
MTB > dotplot c1
.
:
:
. :
: :
: :
: : .
. : : . : :
. : : : . . . : : : . : : : : : : . .
-----+---------+---------+---------+---------+---------+-prize
50 100 150 200 250 300
The data are clearly bimodal. This usually means that there are two
groups present in the data and we need to separate them out before
doing our analysis. Here the two groups are mens' and women's
tournaments.
MTB > dotplot c1;
SUBC> by c2.
Men
.
:
:
sex :
1 :
:
: .
: :
: : : : : : : : . .
-----+---------+---------+---------+---------+---------+-prize
Women
.
:
sex :
2 :
. : :
. : : : . . . : . .
-----+---------+---------+---------+---------+---------+-prize
50 100 150 200 250 300
Now, just for the sake of an example, let's treat the prize monies
above as the population. It is too abnormal for ordinary inference
procedures to work well if we take samples of size five. But if we
did not already know that, would we be able to detect the bimodality
in the samples? Let's take some.
MTB > sample 5 from c1 in c5
MTB > sample 5 from c1 in c6
MTB > sample 5 from c1 in c7
MTB > sample 5 from c1 in c8
MTB > sample 5 from c1 in c9
MTB > sample 5 from c1 in c10
MTB > dotplot c5-c10;
SUBC> same.
8 o o o
+---------+---------+---------+---------+---------+-------C5
o
o o 8
+---------+---------+---------+---------+---------+-------C6
o
o o 8
+---------+---------+---------+---------+---------+-------C7
o o o 8
+---------+---------+---------+---------+---------+-------C8
o o o o o
+---------+---------+---------+---------+---------+-------C9
o o 8 o
+---------+---------+---------+---------+---------+-------C10
40 80 120 160 200 240
For onscreen display purposes, I edited the dotplots so an "8"
represents two data points at (about) the same value while an "o"
represents a single observation. Do these samples look bimodal? My
answer would be -- often enough that it is WORTH LOOKING AT YOUR DATA.
_
| | Robert W. Hayden
| | Work: Department of Mathematics
/ | Plymouth State College MSC#29
| | Plymouth, New Hampshire 03264 USA
| * | fax (603) 535-2943
/ | Home: 82 River Street (use this in the summer)
| ) Ashland, NH 03217
L_____/ (603) 968-9914 (use this year-round)
Map of New [EMAIL PROTECTED] (works year-round)
Hampshire http://mathpc04.plymouth.edu (works year-round)
=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
http://jse.stat.ncsu.edu/
=================================================================