Let me give you a quick answer without puzzling much about what you have
done. Neither confidence intervals nor sample size in the traditional
sense have much to do with your problem, which is why I don't want to
cast this in your terms. The Poisson assumption has possibilities, but
the simplest thing is to use your probability estimate, p, of a mutation
in a string. Assuming the samples are independent, then the probability
of observing a mutation in a sampling of N strings is P=1-(1-p)^N. I'd
set P to 0.5 myself, but you can set it to 0.95 if you like.
Rich Strauss wrote:
> The biological problem involves observing mutations in strings of DNA
> nucleotides ("sites"). Say that we have a DNA sequence of known length
> (i.e., known number of sites). If we make certain assumptions about the
> mutation rates at individual sites (where a mutation is a change in the
> state of a nucleotide), we can estimate the probability of a mutation
> occurring within the sequence. The simplest assumption is that
> mutations are Poisson-distributed, with a constant mean mutation rate
> among sites, but we're also invoking other kinds of assumptions. Since
> mutations are rather rare, the probability estimate is going to be low
> (on the order of, say, 1e-2 to 1e-4). We want to then ask: how many
> samples would we have to examine to be, say, 95% confident of observing
> a mutation within the sequence? This seems to be a simple question
> about statistical power that would have wide application in numerous
> fields.
>
> Since we needed a quick answer last week and since I couldn't figure out
> how to solve the problem as stated and couldn't find reference to this
> problem in the library stacks or on the web, I took a brute-force
> randomization approach by varying the probability level, building up
> distributions of numbers of observations to first hit, and stashing the
> high-end percentiles of the distributions. It took me about 4 hrs to
> run the simulations, fit the prediction surfaces using the Jandel
> TableCurve 3D software, and hard-code the results. I've written a
> Matlab function to return the estimated minimum sample size, given the
> probability estimate and desired confidence level. I've pasted the
> function in below (to avoid sending an attachment), which includes
> within the documentation the simulation code used to generate the
> estimates. A colleague wrote a function to independently encode the
> randomization and spot-check the results, and it seems to work just fine.
>
> But obviously this is a fix, not a solution, and I'd really like to be
> able to solve the problem analytically. Any advice would be appreciated.
>
> Rich Strauss
>
> At 08:47 PM 11/30/2003, you wrote:
>
>> I suspect that have over formulated the problem, and I doubt if any
>> answer you receive will help. Try describing the biological problem
>> you are working on -- give us some background.
>>
>> Rich Strauss wrote:
>>
>>> I have what seems to be a fairly simple and common problem: given an
>>> estimate of the probability of occurrence of an event and a desired
>>> level of confidence (CL), I'm interested in estimating the minimum
>>> sample size (Nmin) needed to be CL% confident of observing the event
>>> (based on the conventional "frequentist" concept of confidence).
>>> I've done some extensive simulations as so can predict what I need,
>>> down to probabilities of occurrence of 1e-5. However, I can't seem
>>> to match these to theory. I know how to estimate the Nmin needed to
>>> estimate a proportion within a specified tolerance, but can't see how
>>> to apply that to this problem. I've scanned quite a few stats texts
>>> and have done Google and Scirus searches but can't locate what I
>>> need. Can anyone suggest a lead or something to read?
>>> Rich Strauss
>>>
>>> =============================================================
>>> Dr. Richard E. Strauss (806) 742-2719
>>> Biological Sciences (806) 742-2963 Fax
>>> Texas Tech University [EMAIL PROTECTED]
>>> Lubbock, TX 79409-3131
>>> <http://www.biol.ttu.edu/Strauss/Strauss.html>
>>> =============================================================
>>> --
>>
>> Bob Wheeler --- http://www.bobwheeler.com/
>> ECHIP, Inc. ---
>> Randomness comes in bunches.
>
>
> % MinObsProb: Given the probability of occurrence of an event and a
> desired level
> % of confidence CL, predicts the minimum number of
> observations needed
> % to be CL% sure of observing the event. Not reliable for
> probabilities
> % < 0.00001 or confidence levels < 90% or so.
> %
> % Usage: Nmin = MinObsProb(P,{CL})
> %
> % P = [r x c] matrix of probabilties.
> % CL = optional scalar confidence level, specified as
> proportion or
> % percentage, or [r x c] matrix of confidence levels
> corresponding
> % to P [default = 95%].
> % -----------------------------------------------------------------------
> % Nmin = [r x c] matrix of corresponding minimum sample sizes.
> %
>
> % RE Strauss, 11/24/03
>
> % Minimum sample sizes were determined by simulation, using the
> following code:
> %
> % P = [1e-5:0.5e-5:1e-4, 1e-4:0.5e-4:1e-3, 1e-3:0.5e-3:1e-2,
> 0.02:0.01:0.99];
> % nP = length(P);
> % CL = [90:99];
> % iter = 10000;
> % firstobs = zeros(iter,nP);
> % for ip = 1:nP
> % p = P(ip);
> % for it = 1:iter
> % nobs = 1;
> % while (rand > p)
> % nobs = nobs+1;
> % end;
> % firstobs(it,ip) = nobs;
> % end;
> % end;
> % Cfirstobs = prctile(firstobs,CL);
> % Cprctile = CL'*ones(1,nP);
> % CP = ones(length(CL),1)*P;
> % results = [CP(:),Cprctile(:),Cfirstobs(:)];
> %
> % Results were divided into four probability intervals (0.00001-0.0001,
> 0.0001-0.001,
> % 0.001-0.01, 0.01-0.1, and 0.1-1), and TableCurve-3D was used to find
> best-fitting
> % polynomial surfaces (based on log of min sample size) separately for
> the four
> % intervals. Adjusted R^2 values for the 5 fittings were, respectively,
> % 0.9995, 0.9995, 0.9998, 0.9997, and 0.9849.
>
> function Nmin = MinObsProb(P,CL)
> if (nargin < 2) CL = []; end;
>
> if (isempty(CL)) CL = 0.95; end;
> if (any(CL < 1))
> CL = CL*100;
> end;
>
> if (isscalar(CL) & ismatrix(P))
> CL = CL*ones(size(P));
> end;
>
> if (size(P)~=size(CL))
> error(' MinObsProb: input matrices not conformable.');
> end;
>
> if (any(CL/100 < 0.90) | any(CL/100 > 0.995))
> disp(' MinObsProb warning: confidence interval out of range for
> reliable prediction.');
> end;
>
> [rr,cc] = size(P);
> Nmin = NaN*ones(rr,cc);
>
> for ir = 1:rr
> for ic = 1:cc
> p = P(ir,ic);
> Nm = NaN;
> if (p >= 0.00001 & p <= 0.0001)
> a = -146125.704; b = 4192.17425; c = 819.436587; d =
> 79.92447509;
> e = 3.890757731; f = 0.075625449; g = 8294.104302; h =
> -177.861938;
> i = 1.90689187; j = -0.010221226; k = 2.1913466e-05;
> x = p;
> y = CL(ir,ic);
> lnx = log(x);
> z =
> a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
> Nm = round(exp(z));
> elseif (p >= 0.0001 & p <= 0.001)
> a = -187732.435; b = 405.8070169; c = 100.7737401;
> d = 12.43646311; e = 0.764583332; f = 0.018733163;
> g = 10097.03515; h = -216.446877; i = 2.319722292;
> j = -0.012429418; k = 2.6637351e-05;
> x = p;
> y = CL(ir,ic);
> lnx = log(x);
> z =
> a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
> Nm = round(exp(z));
> elseif (p > 0.001 & p <= 0.01)
> a = 10119.43848; b = 243.3554215; c = 86.85019909;
> d = 15.34363369; e = 1.347520432; f = 0.04706927;
> g = -423.556653; h = 6.832655942; i = -0.04898875; j =
> 0.000131735;
> x = p;
> y = CL(ir,ic);
> lnx = log(x);
> z = a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4;
> Nm = round(exp(z));
> elseif (p > 0.01 & p <= 0.1)
> a = -144687.708; b = -52.1890074; c = -31.5578708;
> d = -9.58405821; e = -1.43428434; f = -0.084650104;
> g = 7758.436335; h = -166.435027; i = 1.785082047;
> j = -0.0095723345; k = 2.0531475e-05;
> x = p;
> y = CL(ir,ic);
> lnx = log(x);
> z =
> a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
> Nm = round(exp(z));
> elseif (p > 0.1 & p <= 0.999)
> a = 6794.986206; b = -19.6427878; c = 63.81369514;
> d = -123.008008; e = 118.2346839; f = -44.2835887;
> g = -293.51766; h = 4.757317924; i = -0.03427111; j =
> 9.26009e-05;
> x = p;
> y = CL(ir,ic);
> z = a+b*x+c*x^2+d*x^3+e*x^4+f*x^5+g*y+h*y^2+i*y^3+j*y^4;
> Nm = round(exp(z));
> elseif (p>0.999)
> Nm = 1;
> end;
> if (isfinite(Nm))
> Nm = max(Nm,1);
> end;
> Nmin(ir,ic) = Nm;
> end;
> end;
>
> return;
>
>
>
> =============================================================
> Dr. Richard E. Strauss (806) 742-2719
> Biological Sciences (806) 742-2963 Fax
> Texas Tech University [EMAIL PROTECTED]
> Lubbock, TX 79409-3131
> <http://www.biol.ttu.edu/Strauss/Strauss.html>
> =============================================================
>
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> . http://jse.stat.ncsu.edu/ .
> =================================================================
--
Bob Wheeler --- http://www.bobwheeler.com/
ECHIP, Inc. ---
Randomness comes in bunches.
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================