Re: statistics question

Robert J. MacG. Dawson Tue, 16 Dec 2003 12:28:51 -0800


Allan Adler wrote:
> 
> Robert J. MacG. Dawson writes:
> 
> >       So, the question is - can we assume that the probability of an "error"
> >in the larger "dictionary" is independent of whether the "word" is
> >included in the smaller "dictionary"?  If so, this becomes a fairly
> >trivial exercise in binomial sampling.
> 
> Yes, you can assume that the probability of an error in the larger dictionary
> is independent of whether the word is included in the smaller dictionary.
> 
> How can I learn to do this trivial exercise?


        You are trying to determine the proportion P of incorrectly defined
"words" in the big "dictionary". 

        Choose "words" at random from the small "dictionary" and look them up
in the big one. Each time you do this is a Bernoulli experiment with
parameter P; you have probability P of an "error" and (1-P) of no
"error".  Doing it N times is a Binom(N,P) experiment. You then use the
Z interval estimate to get a confidence interval for P (in any standard
first year textbook;or see below).

        Ahead of time you will choose N to give a tight enough interval. The
general formula is: 

                95% CI = p +- 1.96 * sqrt(p(1-p)/N)  

where p is the _observed_ proportion of "errors". We don't know p ahead
of time but we know  that p(1-p) is never greater than 1/4, and is
rather close to 1/4 unless p<0.1 or p>0.9. Moreover, this is a
conservative (large) estimate.

        A sample of size 1000 will give you an interval of size plus-or-minus
3%, 19 times out of 20,if P is more than about 10%. 10,000 gives +- 1%,
and so on.

        If P is smaller, the interval width will be correspondingly smaller.  
This breaks down if P is so small that NP < about 10; in the latter case
other methods can be used based on a Poisson distribution.

        The big question is: are your independence assumptions valid? And that
depends on where the data come from. What works for a dictionary (which,
as I've explained, is not a scenario where your assumption that the
small book is better holds water) may not work for DNA sequencing or
tables of integrals or phone books.

        -Robert Dawson
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: statistics question

Reply via email to