Re: Minimum sample size to observe an event

Bob Wheeler Mon, 01 Dec 2003 12:16:30 -0800

Let me give you a quick answer without puzzling much about what you have 
done. Neither confidence intervals nor sample size in the traditional 
sense have much to do with your problem, which is why I don't want to 
cast this in your terms. The Poisson assumption has possibilities, but 
the simplest thing is to use your probability estimate, p, of a mutation 
in a string. Assuming the samples are independent, then the probability 
of observing a mutation in a sampling of N strings is P=1-(1-p)^N. I'd 
set P to 0.5 myself, but you can set it to 0.95 if you like.


Rich Strauss wrote:
> The biological problem involves observing mutations in strings of DNA 
> nucleotides ("sites").  Say that we have a DNA sequence of known length 
> (i.e., known number of sites).  If we make certain assumptions about the 
> mutation rates at individual sites (where a mutation is a change in the 
> state of a nucleotide), we can estimate the probability of a mutation 
> occurring within the sequence.  The simplest assumption is that 
> mutations are Poisson-distributed, with a constant mean mutation rate 
> among sites, but we're also invoking other kinds of assumptions.  Since 
> mutations are rather rare, the probability estimate is going to be low 
> (on the order of, say, 1e-2 to 1e-4).   We want to then ask: how many 
> samples would we have to examine to be, say, 95% confident of observing 
> a mutation within the sequence?  This seems to be a simple question 
> about statistical power that would have wide application in numerous 
> fields.
> 
> Since we needed a quick answer last week and since I couldn't figure out 
> how to solve the problem as stated and couldn't find reference to this 
> problem in the library stacks or on the web, I took a brute-force 
> randomization approach by varying the probability level, building up 
> distributions of numbers of observations to first hit, and stashing the 
> high-end percentiles of the distributions.  It took me about 4 hrs to 
> run the simulations, fit the prediction surfaces using the Jandel 
> TableCurve 3D software, and hard-code the results.  I've written a 
> Matlab function to return the estimated minimum sample size, given the 
> probability estimate and desired confidence level.  I've pasted the 
> function in below (to avoid sending an attachment), which includes 
> within the documentation the simulation code used to generate the 
> estimates.  A colleague wrote a function to independently encode the 
> randomization and spot-check the results, and it seems to work just fine.
> 
> But obviously this is a fix, not a solution, and I'd really like to be 
> able to solve the problem analytically.  Any advice would be appreciated.
> 
> Rich Strauss
> 
> At 08:47 PM 11/30/2003, you wrote:
> 
>> I suspect that have over formulated the problem, and I doubt if any 
>> answer you receive will help. Try describing the biological problem 
>> you are working on -- give us some background.
>>
>> Rich Strauss wrote:
>>
>>> I have what seems to be a fairly simple and common problem: given an 
>>> estimate of the probability of occurrence of an event and a desired 
>>> level of confidence (CL), I'm interested in estimating the minimum 
>>> sample size (Nmin) needed to be CL% confident of observing the event 
>>> (based on the conventional "frequentist" concept of confidence).  
>>> I've done some extensive simulations as so can predict what I need, 
>>> down to probabilities of occurrence of 1e-5.  However, I can't seem 
>>> to match these to theory.  I know how to estimate the Nmin needed to 
>>> estimate a proportion within a specified tolerance, but can't see how 
>>> to apply that to this problem.  I've scanned quite a few stats texts 
>>> and have done Google and Scirus searches but can't locate what I 
>>> need.  Can anyone suggest a lead or something to read?
>>> Rich Strauss
>>>
>>> =============================================================
>>> Dr. Richard E. Strauss                  (806) 742-2719
>>> Biological Sciences                     (806) 742-2963 Fax
>>> Texas Tech University                   [EMAIL PROTECTED]
>>> Lubbock, TX  79409-3131
>>> <http://www.biol.ttu.edu/Strauss/Strauss.html>
>>> =============================================================
>>> -- 
>>
>> Bob Wheeler --- http://www.bobwheeler.com/
>>         ECHIP, Inc. ---
>> Randomness comes in bunches.
> 
> 
> % MinObsProb: Given the probability of occurrence of an event and a 
> desired level
> %             of confidence CL, predicts the minimum number of 
> observations needed
> %             to be CL% sure of observing the event.  Not reliable for 
> probabilities
> %             < 0.00001 or confidence levels < 90% or so.
> %
> %     Usage: Nmin = MinObsProb(P,{CL})
> %
> %         P =    [r x c] matrix of probabilties.
> %         CL =   optional scalar confidence level, specified as 
> proportion or
> %                percentage, or [r x c] matrix of confidence levels 
> corresponding
> %                  to P [default = 95%].
> % -----------------------------------------------------------------------
> %         Nmin = [r x c] matrix of corresponding minimum sample sizes.
> %
> 
> % RE Strauss, 11/24/03
> 
> % Minimum sample sizes were determined by simulation, using the 
> following code:
> %
> % P = [1e-5:0.5e-5:1e-4, 1e-4:0.5e-4:1e-3, 1e-3:0.5e-3:1e-2, 
> 0.02:0.01:0.99];
> % nP = length(P);
> % CL = [90:99];
> % iter = 10000;
> % firstobs = zeros(iter,nP);
> % for ip = 1:nP
> %   p = P(ip);
> %   for it = 1:iter
> %     nobs = 1;
> %     while (rand > p)
> %       nobs = nobs+1;
> %     end;
> %     firstobs(it,ip) = nobs;
> %   end;
> % end;
> % Cfirstobs = prctile(firstobs,CL);
> % Cprctile = CL'*ones(1,nP);
> % CP = ones(length(CL),1)*P;
> % results = [CP(:),Cprctile(:),Cfirstobs(:)];
> %
> % Results were divided into four probability intervals (0.00001-0.0001, 
> 0.0001-0.001,
> %   0.001-0.01, 0.01-0.1, and 0.1-1), and TableCurve-3D was used to find 
> best-fitting
> %   polynomial surfaces (based on log of min sample size) separately for 
> the four
> %   intervals.  Adjusted R^2 values for the 5 fittings were, respectively,
> %   0.9995, 0.9995, 0.9998, 0.9997, and 0.9849.
> 
> function Nmin = MinObsProb(P,CL)
>   if (nargin < 2) CL = []; end;
> 
>   if (isempty(CL)) CL = 0.95; end;
>   if (any(CL < 1))
>     CL = CL*100;
>   end;
> 
>   if (isscalar(CL) & ismatrix(P))
>     CL = CL*ones(size(P));
>   end;
> 
>   if (size(P)~=size(CL))
>     error('  MinObsProb: input matrices not conformable.');
>   end;
> 
>   if (any(CL/100 < 0.90) | any(CL/100 > 0.995))
>     disp('  MinObsProb warning: confidence interval out of range for 
> reliable prediction.');
>   end;
> 
>   [rr,cc] = size(P);
>   Nmin = NaN*ones(rr,cc);
> 
>   for ir = 1:rr
>     for ic = 1:cc
>       p = P(ir,ic);
>       Nm = NaN;
>       if (p >= 0.00001 & p <= 0.0001)
>         a = -146125.704;  b = 4192.17425;  c = 819.436587;  d = 
> 79.92447509;
>         e = 3.890757731;  f = 0.075625449;  g = 8294.104302;  h = 
> -177.861938;
>         i = 1.90689187;  j = -0.010221226;  k = 2.1913466e-05;
>         x = p;
>         y = CL(ir,ic);
>         lnx = log(x);
>         z = 
> a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
>         Nm = round(exp(z));
>       elseif (p >= 0.0001 & p <= 0.001)
>         a = -187732.435;  b = 405.8070169;  c = 100.7737401;
>         d = 12.43646311;  e = 0.764583332;  f = 0.018733163;
>         g = 10097.03515;  h = -216.446877;  i = 2.319722292;
>         j = -0.012429418;  k = 2.6637351e-05;
>         x = p;
>         y = CL(ir,ic);
>         lnx = log(x);
>         z = 
> a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
>         Nm = round(exp(z));
>       elseif (p > 0.001 & p <= 0.01)
>         a = 10119.43848;  b = 243.3554215;  c = 86.85019909;
>         d = 15.34363369;  e = 1.347520432;  f = 0.04706927;
>         g = -423.556653;  h = 6.832655942;  i = -0.04898875;  j = 
> 0.000131735;
>         x = p;
>         y = CL(ir,ic);
>         lnx = log(x);
>         z = a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4;
>         Nm = round(exp(z));
>       elseif (p > 0.01 & p <= 0.1)
>         a = -144687.708;  b = -52.1890074;  c = -31.5578708;
>         d = -9.58405821;  e = -1.43428434;  f = -0.084650104;
>         g = 7758.436335;  h = -166.435027;  i = 1.785082047;
>         j = -0.0095723345;  k = 2.0531475e-05;
>         x = p;
>         y = CL(ir,ic);
>         lnx = log(x);
>         z = 
> a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
>         Nm = round(exp(z));
>       elseif (p > 0.1 & p <= 0.999)
>         a = 6794.986206;  b = -19.6427878;  c = 63.81369514;
>         d = -123.008008;  e = 118.2346839;  f = -44.2835887;
>         g = -293.51766;  h = 4.757317924;  i = -0.03427111;  j = 
> 9.26009e-05;
>         x = p;
>         y = CL(ir,ic);
>         z = a+b*x+c*x^2+d*x^3+e*x^4+f*x^5+g*y+h*y^2+i*y^3+j*y^4;
>         Nm = round(exp(z));
>       elseif (p>0.999)
>         Nm = 1;
>       end;
>       if (isfinite(Nm))
>         Nm = max(Nm,1);
>       end;
>       Nmin(ir,ic) = Nm;
>     end;
>   end;
> 
>   return;
> 
> 
> 
> =============================================================
> Dr. Richard E. Strauss                  (806) 742-2719
> Biological Sciences                     (806) 742-2963 Fax
> Texas Tech University                   [EMAIL PROTECTED]
> Lubbock, TX  79409-3131
> <http://www.biol.ttu.edu/Strauss/Strauss.html>
> =============================================================
> 
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> .                  http://jse.stat.ncsu.edu/                    .
> =================================================================


-- 
Bob Wheeler --- http://www.bobwheeler.com/
         ECHIP, Inc. ---
Randomness comes in bunches.

.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: Minimum sample size to observe an event

Reply via email to