Karl asks:

    When interested in the relationship between two continuous
variables, some researchers will dichotomize one of them prior to analysis.
I generally discourage such dichotomization, but the practice is common.  A
colleague asked me today about the practice of dichotomizing by a median
split (top half versus bottom half) versus the practice of using only the
tails (bottom third versus top third, for example).  I outlined my thoughts
on this matter and noted that I vaguely recall having read an article or two
on this matter long ago, but cannot put my finger on the article(s).  Can
any of you all?



consider the model Y = a + b X + error

a key component of calculating the standard error of the estimate of b and
its confidence interval
is N * V(X).  Tradeoffs between N and the Variance of X are exact.  We can
use this to examine the
effect of (a) splitting X at its median or (b) using only the upper and
lower 3rds of the distribution.
Note that no matter what the distribution of X, the usual regression
provides an unbiased least-squares
estimate of the coefficient b.  In particular if we split the observations
on the basis of X, to compute
the mean of Y, if we also compute the mean of X within each subgroup and use
that as the
predictor values in a regression, we will still get an unbiased estimate of
b, but with a different
standard error.  Comparing the standard error for the continuous X and the
split X allows
an examination of the effects of splitting.

(a) Median split.  Let's assume a standard normal distribution for
illustration.  If we
split at the median, this will also be splitting at the mean.  The mean
value of X
in the lower half of the distribution is -Sqrt[2/Pi] = -.8 and the mean for
the top half of the
distribution is Sqrt[2/Pi] = .8.  The new variance is 2/Pi = .636.  All the
components of estimating the
standard error of b will be the same for both the continuous and the split
model except
   N V(X) = N (for the standard normal)
in the continuous model will be replaced by
   N .636V(X) = .636 N (for the standard normal)
This is the same proportion by which the r^2 will be reduced and this value
has appeared
in numerous articles criticizing the splitting of data.

(b) For the case of using the upper 1/3 and lower 1/3 of cases.  For a
standard normal distribution
the mean of the lower 1/3 of the values of X is -1.09 and the mean for the
upper half is then +1.09.
So the variance is 1.09^2 = 1.19.  But we've also lost 1/3 of our cases.
hence, the term
     N V(X) = N
is replaced in the thirds model by
    (2/3) N 1.19 V(X) = .79 N
Thus, in terms of the standard error and the confidence interval width,
using only the top 1/3
and bottom 1/3 of the data is not as destructive as median splits, but it is
still a bad idea.
Furthermore, for modest sizes of N, the loss of 1/3 of the degrees of
freedom might
substantially increase the value of the critical t.  In other words, the
thirds model
will have substantially less statistical power.

The message, repeated in numerous methodological articles, and well known by
Pearson in 1900
is that (a) throwing away information about your variable is never a good
idea and (b)
throwing away observations in the middle of the distribution is never a good
idea.

I've always thought a physicist considering the non uncommon practice in the
social
sciences of doing median splits or using discrete cutoffs of a continuous
variable would
think our practices are crazy and unscientific.  Last fall I had the
opportunity to observe
a confirmation of my hypothesis when a Nobel-prize winning physicist sat on
the honors
committee of one our psychology students.  She was studying reading
disability and as is
not uncommon in that field, defined the reading disabled as those below the
10th percentile.
The physicist gently but firmly pointed out that surely that was a bad idea
and that it
would obviously be better to leave a continuous variable as a continuous
variable.

So, resist the temptation to split.  Leave your continuous variables
continuous.

Useful reading:

Irwin, J.R., & McClelland, G.H., Journal of Marketing Research, forthcoming

MacCallum R.C., Zhang, S., Preacher, K.J., & Rucker, D.D. (2002).  On the
practice of dichotomization
      of quantitative variables.  Psychological Methods, 7(1), 19-40.

both provide a lot of the earlier references.  I know of no published
article using statistical arguments
to support splitting continuous variables.

gary
[EMAIL PROTECTED]


.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to