The biological problem involves observing mutations in strings of DNA nucleotides ("sites"). Say that we have a DNA sequence of known length (i.e., known number of sites). If we make certain assumptions about the mutation rates at individual sites (where a mutation is a change in the state of a nucleotide), we can estimate the probability of a mutation occurring within the sequence. The simplest assumption is that mutations are Poisson-distributed, with a constant mean mutation rate among sites, but we're also invoking other kinds of assumptions. Since mutations are rather rare, the probability estimate is going to be low (on the order of, say, 1e-2 to 1e-4). We want to then ask: how many samples would we have to examine to be, say, 95% confident of observing a mutation within the sequence? This seems to be a simple question about statistical power that would have wide application in numerous fields.

Since we needed a quick answer last week and since I couldn't figure out how to solve the problem as stated and couldn't find reference to this problem in the library stacks or on the web, I took a brute-force randomization approach by varying the probability level, building up distributions of numbers of observations to first hit, and stashing the high-end percentiles of the distributions. It took me about 4 hrs to run the simulations, fit the prediction surfaces using the Jandel TableCurve 3D software, and hard-code the results. I've written a Matlab function to return the estimated minimum sample size, given the probability estimate and desired confidence level. I've pasted the function in below (to avoid sending an attachment), which includes within the documentation the simulation code used to generate the estimates. A colleague wrote a function to independently encode the randomization and spot-check the results, and it seems to work just fine.

But obviously this is a fix, not a solution, and I'd really like to be able to solve the problem analytically. Any advice would be appreciated.

Rich Strauss

At 08:47 PM 11/30/2003, you wrote:
I suspect that have over formulated the problem, and I doubt if any answer you receive will help. Try describing the biological problem you are working on -- give us some background.

Rich Strauss wrote:
I have what seems to be a fairly simple and common problem: given an estimate of the probability of occurrence of an event and a desired level of confidence (CL), I'm interested in estimating the minimum sample size (Nmin) needed to be CL% confident of observing the event (based on the conventional "frequentist" concept of confidence). I've done some extensive simulations as so can predict what I need, down to probabilities of occurrence of 1e-5. However, I can't seem to match these to theory. I know how to estimate the Nmin needed to estimate a proportion within a specified tolerance, but can't see how to apply that to this problem. I've scanned quite a few stats texts and have done Google and Scirus searches but can't locate what I need. Can anyone suggest a lead or something to read?
Rich Strauss


=============================================================
Dr. Richard E. Strauss                  (806) 742-2719
Biological Sciences                     (806) 742-2963 Fax
Texas Tech University                   [EMAIL PROTECTED]
Lubbock, TX  79409-3131
<http://www.biol.ttu.edu/Strauss/Strauss.html>
=============================================================
--
Bob Wheeler --- http://www.bobwheeler.com/
        ECHIP, Inc. ---
Randomness comes in bunches.

% MinObsProb: Given the probability of occurrence of an event and a desired level
% of confidence CL, predicts the minimum number of observations needed
% to be CL% sure of observing the event. Not reliable for probabilities
% < 0.00001 or confidence levels < 90% or so.
%
% Usage: Nmin = MinObsProb(P,{CL})
%
% P = [r x c] matrix of probabilties.
% CL = optional scalar confidence level, specified as proportion or
% percentage, or [r x c] matrix of confidence levels corresponding
% to P [default = 95%].
% -----------------------------------------------------------------------
% Nmin = [r x c] matrix of corresponding minimum sample sizes.
%


% RE Strauss, 11/24/03

% Minimum sample sizes were determined by simulation, using the following code:
%
% P = [1e-5:0.5e-5:1e-4, 1e-4:0.5e-4:1e-3, 1e-3:0.5e-3:1e-2, 0.02:0.01:0.99];
% nP = length(P);
% CL = [90:99];
% iter = 10000;
% firstobs = zeros(iter,nP);
% for ip = 1:nP
% p = P(ip);
% for it = 1:iter
% nobs = 1;
% while (rand > p)
% nobs = nobs+1;
% end;
% firstobs(it,ip) = nobs;
% end;
% end;
% Cfirstobs = prctile(firstobs,CL);
% Cprctile = CL'*ones(1,nP);
% CP = ones(length(CL),1)*P;
% results = [CP(:),Cprctile(:),Cfirstobs(:)];
%
% Results were divided into four probability intervals (0.00001-0.0001, 0.0001-0.001,
% 0.001-0.01, 0.01-0.1, and 0.1-1), and TableCurve-3D was used to find best-fitting
% polynomial surfaces (based on log of min sample size) separately for the four
% intervals. Adjusted R^2 values for the 5 fittings were, respectively,
% 0.9995, 0.9995, 0.9998, 0.9997, and 0.9849.


function Nmin = MinObsProb(P,CL)
  if (nargin < 2) CL = []; end;

  if (isempty(CL)) CL = 0.95; end;
  if (any(CL < 1))
    CL = CL*100;
  end;

  if (isscalar(CL) & ismatrix(P))
    CL = CL*ones(size(P));
  end;

  if (size(P)~=size(CL))
    error('  MinObsProb: input matrices not conformable.');
  end;

if (any(CL/100 < 0.90) | any(CL/100 > 0.995))
disp(' MinObsProb warning: confidence interval out of range for reliable prediction.');
end;


  [rr,cc] = size(P);
  Nmin = NaN*ones(rr,cc);

for ir = 1:rr
for ic = 1:cc
p = P(ir,ic);
Nm = NaN;
if (p >= 0.00001 & p <= 0.0001)
a = -146125.704; b = 4192.17425; c = 819.436587; d = 79.92447509;
e = 3.890757731; f = 0.075625449; g = 8294.104302; h = -177.861938;
i = 1.90689187; j = -0.010221226; k = 2.1913466e-05;
x = p;
y = CL(ir,ic);
lnx = log(x);
z = a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
Nm = round(exp(z));
elseif (p >= 0.0001 & p <= 0.001)
a = -187732.435; b = 405.8070169; c = 100.7737401;
d = 12.43646311; e = 0.764583332; f = 0.018733163;
g = 10097.03515; h = -216.446877; i = 2.319722292;
j = -0.012429418; k = 2.6637351e-05;
x = p;
y = CL(ir,ic);
lnx = log(x);
z = a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
Nm = round(exp(z));
elseif (p > 0.001 & p <= 0.01)
a = 10119.43848; b = 243.3554215; c = 86.85019909;
d = 15.34363369; e = 1.347520432; f = 0.04706927;
g = -423.556653; h = 6.832655942; i = -0.04898875; j = 0.000131735;
x = p;
y = CL(ir,ic);
lnx = log(x);
z = a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4;
Nm = round(exp(z));
elseif (p > 0.01 & p <= 0.1)
a = -144687.708; b = -52.1890074; c = -31.5578708;
d = -9.58405821; e = -1.43428434; f = -0.084650104;
g = 7758.436335; h = -166.435027; i = 1.785082047;
j = -0.0095723345; k = 2.0531475e-05;
x = p;
y = CL(ir,ic);
lnx = log(x);
z = a+b*lnx+c*lnx^2+d*lnx^3+e*lnx^4+f*lnx^5+g*y+h*y^2+i*y^3+j*y^4+k*y^5;
Nm = round(exp(z));
elseif (p > 0.1 & p <= 0.999)
a = 6794.986206; b = -19.6427878; c = 63.81369514;
d = -123.008008; e = 118.2346839; f = -44.2835887;
g = -293.51766; h = 4.757317924; i = -0.03427111; j = 9.26009e-05;
x = p;
y = CL(ir,ic);
z = a+b*x+c*x^2+d*x^3+e*x^4+f*x^5+g*y+h*y^2+i*y^3+j*y^4;
Nm = round(exp(z));
elseif (p>0.999)
Nm = 1;
end;
if (isfinite(Nm))
Nm = max(Nm,1);
end;
Nmin(ir,ic) = Nm;
end;
end;


return;



=============================================================
Dr. Richard E. Strauss                  (806) 742-2719
Biological Sciences                     (806) 742-2963 Fax
Texas Tech University                   [EMAIL PROTECTED]
Lubbock, TX  79409-3131
<http://www.biol.ttu.edu/Strauss/Strauss.html>
=============================================================

.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Reply via email to