Allan Adler wrote:
>
> Robert J. MacG. Dawson writes:
>
> > So, the question is - can we assume that the probability of an "error"
> >in the larger "dictionary" is independent of whether the "word" is
> >included in the smaller "dictionary"? If so, this becomes a fairly
> >trivial exercise in binomial sampling.
>
> Yes, you can assume that the probability of an error in the larger dictionary
> is independent of whether the word is included in the smaller dictionary.
>
> How can I learn to do this trivial exercise?
You are trying to determine the proportion P of incorrectly defined
"words" in the big "dictionary".
Choose "words" at random from the small "dictionary" and look them up
in the big one. Each time you do this is a Bernoulli experiment with
parameter P; you have probability P of an "error" and (1-P) of no
"error". Doing it N times is a Binom(N,P) experiment. You then use the
Z interval estimate to get a confidence interval for P (in any standard
first year textbook;or see below).
Ahead of time you will choose N to give a tight enough interval. The
general formula is:
95% CI = p +- 1.96 * sqrt(p(1-p)/N)
where p is the _observed_ proportion of "errors". We don't know p ahead
of time but we know that p(1-p) is never greater than 1/4, and is
rather close to 1/4 unless p<0.1 or p>0.9. Moreover, this is a
conservative (large) estimate.
A sample of size 1000 will give you an interval of size plus-or-minus
3%, 19 times out of 20,if P is more than about 10%. 10,000 gives +- 1%,
and so on.
If P is smaller, the interval width will be correspondingly smaller.
This breaks down if P is so small that NP < about 10; in the latter case
other methods can be used based on a Poisson distribution.
The big question is: are your independence assumptions valid? And that
depends on where the data come from. What works for a dictionary (which,
as I've explained, is not a scenario where your assumption that the
small book is better holds water) may not work for DNA sequencing or
tables of integrals or phone books.
-Robert Dawson
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================