testing distributions with few events per bin

D. Wright Sun, 16 Jul 2000 11:40:14 -0700

I'm a newbie to the group but not to statistics.  Here is my problem: I
have a probability distribtion over a discrete (or countably infinite,
in any case "naturally binned") space and I want to test whether a
sample agrees with the distribution when some bins have few events.

Here is an example.  The ranks of random 8 x 8 binary matrices (all
entries are 0 or 1) are distributed as follows:

  rank  probab   sample (eg N=100)
  8     0.2899   28
  7     0.5776   54
  6     0.1273   15
  5     0.00512  2
  4     4.4e-05
  3     8.5e-08  1
  2     1.7e-11
  1     3.5e-15
  0     5.3e-20

I want to test whether a sample of matrices is random at some given
confidence level.  Most matrices are nearly full-rank, so there will be
very few events in the low-rank bins.  Here is the progress of my
thought:

1) The traditional trick is to combine enough low-event bins together
that the expected number of events is 10 or so, and then do a chi2
test.  But this patently throws away information.  A rank-1 matrix event
is telling me much more than a single rank-3 matrix event, but this
technique gives them equal weight.  So I want a test that doesn't
require me to combine bins.

2) How about a KS test?  I tried this, but it came back with garbage
(told
me that data which I know to be good, and which did fine in a chi2 test,
was "too good to be true").  I believe one assumption in the KS test is
that the distribution being tested is continuous, ie not discrete.  So
this doesn't work.

3) So what do I do?  I would really like a test statistic that becomes
chi2 when bin populations are high, but can also handle low bin
populations.  The distribution of the statistic need not be universal: I
have analytic expressions for the probabilities of the inidividual bins
and and perfectly willing to plug these into some hellishly complex
formula for the distribution of a test statistic.  Does such a test
statistic exist?

4) I have played with this problem for a few days now.  Basically, I
think I am looking for a metric on the space of multinomial
distributions which would characterize the distance between two
distribtuions as something close to the ratio of their likelyhoods.

Can someone help me out here?  Is this a solved problem?  Can someone
give a hint or point me in the right direction?


=================================================================
Instructions for joining and leaving this list and remarks about
the problem of INAPPROPRIATE MESSAGES are available at
                  http://jse.stat.ncsu.edu/
=================================================================
testing distributions with few events per bin

Reply via email to