First, a comment about the recent conversation; then some new, related
stuff:

I agree with Donald Burrill; there is no assumption that histograms have
equal bar widths.

See (I think) Tukey in EDA. The example I remember is for looking at an
income distribution, where you might well have bins of, say, 0-5K, 5-10K,
10-25, 25-50, 50-100, etc. This is an example where the natural categories,
the natural bins, are unequal, and it would make sense to create a display
that shows the categories _on a continuous axis_.

If I remember right, the official, orthodox view of histograms, is that they
can have arbitrary bar widths, with DENSITY on the other axis. That is, (as
Donald described) AREA is proportional to frequency, not the height of the
bar. Furthermore, it's easy for people to interpret these charts informally
(the distributions look right). The hard part is answering detailed
questions such as, "how many people earn from $25-50K?" -- which might be
better answered by a table anyhow.

It is only in the special case -- where bar widths are equal -- that the
heights of the bars are proportional to frequency. Alas, since this "special
case" is so common, we get used to "histogram" == "a frequency chart, rather
like a bar chart but for continuous variables."

-----------------------
Now for the new part.

A common histogram -- equal bin width, different frequencies -- is one
special case for a histogram.

A few years ago, I helped implement another special case:

   equal frequencies, changing bin widths.

We called this an "Ntigram" (pronounced "EN-ti-gram") (someone else must
have invented it too, but it's only been practical since microcomputers).
That is, if you have ten bins of equal frequency, each represents a decile.
Four bins, a quartile, N bins an N-tile, whence N-tigram.

These graphs are pretty interesting, especially if you have more than about
10 cases in each bin. Here's why, I think: When you look at the distribution
of a population with a common histogram, you're always asking, is this
feature I see real?

Consider an age distribution of a sample from a community. It humps up in
the middle (baby boomers, college students) and trails off at the end. If
you bin by five years, up at the top end, you see a peak of 70-75 year olds
and a gap at 75-80. Is it real? If there are only five people in the "peak,"
probably not. So we want to smooth that out.

But when you smooth it out by increasing the bin size, you lose possibly
real structure in the more populous areas of the distribution.

With an Ntigram, by contrast, if you plot the 20-iles, you get skinny bins
(lots of structure) where you have lots of people, and wide bins (less
structure, more smoothing) where the population is low.

Anyhow, these graphs are part of Fathom (www.keypress.com/fathom). If you
want to play with them, download the demo. Load in some Census microdata,
make a graph, and display "age." You can change the default dot pot to a box
plot, a histogram, or to an Ntigram; drag on the bar edges to change the
widths. (Windows only for now, but I'm playing with the nascent Mac version
and it's great.).


-- 
Tim Erickson * eeps media * [EMAIL PROTECTED]
5269 Miles Avenue, Oakland CA 94618 * 510.653.3377




=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================

Reply via email to