Re: sampling formula for proportion without independence?

kjetil Wed, 05 Nov 2003 13:44:22 -0800

On 4 Nov 2003 at 19:26, Scott Edwards  Tallahassee,FL wrote:

> Thanks Kjetil,
> I hadn't thought of using simulation.  I've done a bit of discrete
> event simulation for operations research problems, but this looks
> quite a bit different.  To be honest, I haven't used R before, and I
> simply can't follow your code.  What assumptions does a simulation
> like this make? (similar to the ones statistical models have like
> independence, normal curve, homoscedasticity, etc)
> Scott
>


Assumptions: This assumes a binomial model for each employee 
completing records, with the same probablity for each record by the 
same employee, and conditional independence given the prbability for 
the employee. 

On the next level it assumes that the probability is not the same for 
different employees, but that it varies according to some 
distribution, which in the simulation was taken to be normal 
(normality here is not essential). For practical reasons I expressed 
the probability on the logit scale, mostly to avoid proabilities 
outside [0,1]. All of this amounts to model the situation with a 
glmm (generalized linear mixed model), which I the estiamte with the 
approximate PQL method, implemented by glmmPQL in library MASS, 
depending on lme in library nlme. (This method of estimation is akin 
to using iteratively reweighted least squares to estimate a usual 
logistic regression, just that the weighted least squares step is 
replaced by a linear mixed model). 

The crucial part in the modelling is the st.dev. of logit(p), 
sdepsilon in the code. If this is zero, we have the independence 
model, and as this increases, the dependence within each employee 
increeses. It might be possible to translate this into some kind of 
intra-class correlation coefficient, but none of my references has 
this for discrete data. 

Kjetil Halvorsen




> [EMAIL PROTECTED] wrote in message news:<[EMAIL PROTECTED]>...
> > On 29 Oct 2003 at 19:49, Scott Edwards  Tallahassee,FL wrote:
> > 
> > I would attck this kind of problem with simulation, much faster than 
> > trying to get at an formula. Your problem could be frased as a 
> > multilevel problem, so you could find more help at the multilevel 
> > list:
> > [EMAIL PROTECTED]
> > 
> > I would approch simulation using a binomial mixec model, with groups 
> > the individuals, using R:
> > 
> > library(nlme)
> > library(MASS)
> > > m <- 20    # records per person
> > > n <- 100   # number of persons
> > > a <- log(0.1/0.9) # logit model for p=0.1
> > > sdepsilon <- 0.2   # standard deviation on logit scale
> > > Y <- numeric(n*m)  # binary observation vector
> > > id <- factor(rep(1:n, rep(m,n))   )  # group identification
> > > for (i in 1:n) {   # simulation
> > +     epsilon <- rnorm(1, sd=sdepsilon)
> > +     p <- exp( a + epsilon)/(1+exp(a+epsilon))
> > +     for (j in 1:m) {
> > +          Y[(i-1)*m + j] <- rbinom(1,1,p)
> > + }}
> > mod1 <- glmmPQL(fixed= Y ~ 1, random = ~ 1 | id, family=binomial)
> > > intervals(mod1)
> > Approximate 95% confidence intervals
> > 
> >  Fixed effects:
> >                 lower      est.     upper
> > (Intercept) -2.326024 -2.180668 -2.035312        # this is the        
> >                                              # interval for "a "
> >                                              # whjich you are         
> >                                      #interested  in
> > attr(,"label")
> > [1] "Fixed effects:"
> > 
> >  Random Effects:
> >   Level: id 
> >                        lower       est.    upper
> > sd((Intercept)) 1.030459e-05 0.03553273 122.5254
> > 
> >  Within-group standard error:
> >     lower      est.     upper 
> > 0.9693968 0.9998783 1.0313182 
> > 
> > 
> > Note that the interval is on logit scale, you must tarnsform to get 
> > it on probability scale.
> > 
> > Hope this helps, 
> > 
> > Kjetil Halvorsen
> > 
> > 
> > > Rich,
> > > Thanks for your response.  See comments/questions interspersed below.
> > > Scott
> > > 
> > > 
> > > Rich Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> > > > On 28 Oct 2003 17:55:10 -0800, [EMAIL PROTECTED] (Scott
> > > > Edwards  Tallahassee,FL) wrote:
> > > > 
> > > > > Does anyone know of a sampling formula to use for a proportion when
> > > > > you have 'clusters' of dependent cases in the population?
> > > > > I have to calculate a proportion of certain criteria that are met upon
> > > > > inspection of records in an agency.  The problem is that a single
> > > > > employee may complete 10-20 records, therefore the assumption of
> > > > > independence of cases is blown.  Therefore, I can't use the old
> > > > > tried-and-true sampling formula:
> > > > > 
> > > > >           2 
> > > > >          t PQ                   n=sample size
> > > > >          ----                   t=t-value for desired confidence level
> > > > >            2                      (e.g. 1.96 for 95%)
> > > > >           d                     P=proportion with trait in population
> > > > > n=  --------------------         (if unknown, 50% most conservative)
> > > > >                 2               Q=1-P
> > > > >         1      t PQ             d=desired precision
> > > > >     1 + -- * ( -----  -  1 )     (confidence int = proportion +-d)
> > > > >         N        2              N=population size
> > > > >                 d
> > > > > 
> > > > > which is used for, like, political polls where every case is an
> > > > > independent person.
> > > > > What do I do?
> > > > 
> > > > You are asking for n, for the planning of a survey among N, 
> > > > and your formula is using Finite population correction.
> > > > You can check with groups.google and see how often 
> > > > I have told  people that FPC  is *usually*  a bad idea.   
> > > 
> > >   I was unaware of this. I will check your former messages.
> > > This appeared to be the 'standard' formula for sample size calculation
> > > when you are interested in a proportion of items that pass/fail, *and
> > > you have independence* (e.g. political polls), so I'm afraid that many
> > > of us are making this error. However, I am definitely not tied to this
> > > formula and am just looking for a method to get the job done as
> > > accurately as possible.
> > >   
> > > > However, from what you say, I can see that you *might*  
> > > > have an application that calls for it.  On the other hand,
> > > > if you are trying to meet the requirement of state or
> > > > federal regulations, the procedures are probably 
> > > > spelled out in detail.  If you are trying to create methods
> > > > for a regulatory system, then you need more consultation
> > > > than you can get  for free by e-mail.
> > > 
> > >   Actually, I was attempting to state why I *couldn't* use this
> > > formula, since it assumes independence, which I must be very suspect
> > > of.
> > > Regarding your comment on regulation, this is simply a data analysis
> > > problem - I'm not clear why the issue of regulation is relevant.  If
> > > the methodology had been laid out in a regulation then I definitely
> > > wouldn't be wasting your guys time asking help in formulating one. 
> > > The problem from a research design/analysis standpoint doesn't strike
> > > me as *that* unusual.  I've read many times of how analyses must be
> > > adjusted due to 'clusters' of data points that are not independent
> > > (eg. effect of temperature on performance of athletes measured
> > > multiple time), I just haven't seen how to approach it from a sample
> > > size determination perspective.  Actually, it occurs to me that
> > > perhaps I should be conceptualizing the design as one of
> > > repeated-measures (each employee being a subject, with each record
> > > being evaluated being a point of measurement) - would that perhaps
> > > clarify how to determine the sample size?  The problem of course is
> > > that I must end up with a point estimate and confidence interval for
> > > the entire _agency_, rather than looking at the difference between
> > > subjects, as in a typical repeated measures experiment.  Still seems
> > > like it could be done.
> > > 
> > >   WIth regard to free vs. paid help, I certainly have to objection to
> > > people being paid for their time, and would have no objection to
> > > pursue that path if I had sufficient resources to pay someone
> > > $100/hour.  However, I posted it here for two additional important
> > > reasons.
> > > 
> > >     1.  I wouldn't be sure who to approach to pay - I've tried all my
> > > stat friends and colleagues to no avail - and by posting it here I
> > > thought that I would approach the largest possible audience.
> > > 
> > >     2.  I was under the impression that this group was for the purpose
> > > of discussing interesting statistical issues/problems that had
> > > applicablility beyond the specific problem.  Perhaps I'm missing
> > > something, but I'm unable to see your perspective that this problem
> > > would only come up in the context of 'regulation' - to the contrary,
> > > it seems to me it would come up in many instances of evaluating
> > > organizations as a whole, with many individuals, performing multiple
> > > tasks (e.g a factory, with many employees, making many widgets each
> > > and you wanted to estimate *factory-wide* the proportion of defects in
> > > the widgets that was occurring - this is the *exact* same problem that
> > > I have)
> > > 
> > >  
> > > > The first thing I would do is check whether there 
> > > > actually is *dependence*.  
> > > > You can't assume independence,
> > > > but you may be able to demonstrate it.
> > > 
> > > Unfortunately, I don't have a sample of the data to estimate the
> > > degree of dependence.  Plus, if I wanted to go out and get such a
> > > sample, I'd have to bug you guys for help on how to determine *that*
> > > sample size.  :)
> > > 
> > > > If the people's responses are not independent, that 
> > > > changes the *sort*  of statement that you should make/
> > > >  - Is there a correlation of 'p'  with N?  
> > > 
> > > Do you mean lower case 'p' (i.e. p-value) or the P in the formula
> > > above?
> > > 
> > > >  - Is this <something> bad, or neutral -- that is, do you *have*  to
> > > > place a limit on it, or should you seek a neutral statement 
> > > > of what exists, which gives the best feel for the distribution?
> > > > (I am asserting that the mean and CI  is apt to be a thin 
> > > > statement, if you are looking for understanding.)
> > > 
> > > Not sure if I understand this part completely, but I definitely am
> > > looking for a neutral, unbiased statement of my best guess of what
> > > *exists* (with a confidence level and interval of course)
> > > 
> > > > Hope this helps.
> > > 
> > > Thanks for your time,
> > > 
> > > Scott Edwards
> > > .
> > > .
> > > =================================================================
> > > Instructions for joining and leaving this list, remarks about the
> > > problem of INAPPROPRIATE MESSAGES, and archives are available at:
> > > .                  http://jse.stat.ncsu.edu/                    .
> > > =================================================================
> > 
> > 
> > .
> > .
> > =================================================================
> > Instructions for joining and leaving this list, remarks about the
> > problem of INAPPROPRIATE MESSAGES, and archives are available at:
> > .                  http://jse.stat.ncsu.edu/                    .
> > =================================================================
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> .                  http://jse.stat.ncsu.edu/                    .
> =================================================================


.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: sampling formula for proportion without independence?

Reply via email to