Re: sampling formula for proportion without independence?

Scott Edwards Tallahassee,FL Wed, 05 Nov 2003 06:53:48 -0800

Thanks Kjetil,
I hadn't thought of using simulation.  I've done a bit of discrete
event simulation for operations research problems, but this looks
quite a bit different.  To be honest, I haven't used R before, and I
simply can't follow your code.  What assumptions does a simulation
like this make? (similar to the ones statistical models have like
independence, normal curve, homoscedasticity, etc)
Scott


[EMAIL PROTECTED] wrote in message news:<[EMAIL PROTECTED]>...
> On 29 Oct 2003 at 19:49, Scott Edwards  Tallahassee,FL wrote:
> 
> I would attck this kind of problem with simulation, much faster than 
> trying to get at an formula. Your problem could be frased as a 
> multilevel problem, so you could find more help at the multilevel 
> list:
> [EMAIL PROTECTED]
> 
> I would approch simulation using a binomial mixec model, with groups 
> the individuals, using R:
> 
> library(nlme)
> library(MASS)
> > m <- 20    # records per person
> > n <- 100   # number of persons
> > a <- log(0.1/0.9) # logit model for p=0.1
> > sdepsilon <- 0.2   # standard deviation on logit scale
> > Y <- numeric(n*m)  # binary observation vector
> > id <- factor(rep(1:n, rep(m,n))   )  # group identification
> > for (i in 1:n) {   # simulation
> +     epsilon <- rnorm(1, sd=sdepsilon)
> +     p <- exp( a + epsilon)/(1+exp(a+epsilon))
> +     for (j in 1:m) {
> +          Y[(i-1)*m + j] <- rbinom(1,1,p)
> + }}
> mod1 <- glmmPQL(fixed= Y ~ 1, random = ~ 1 | id, family=binomial)
> > intervals(mod1)
> Approximate 95% confidence intervals
> 
>  Fixed effects:
>                 lower      est.     upper
> (Intercept) -2.326024 -2.180668 -2.035312        # this is the        
>                                              # interval for "a "
>                                              # whjich you are         
>                                      #interested  in
> attr(,"label")
> [1] "Fixed effects:"
> 
>  Random Effects:
>   Level: id 
>                        lower       est.    upper
> sd((Intercept)) 1.030459e-05 0.03553273 122.5254
> 
>  Within-group standard error:
>     lower      est.     upper 
> 0.9693968 0.9998783 1.0313182 
> 
> 
> Note that the interval is on logit scale, you must tarnsform to get 
> it on probability scale.
> 
> Hope this helps, 
> 
> Kjetil Halvorsen
> 
> 
> > Rich,
> > Thanks for your response.  See comments/questions interspersed below.
> > Scott
> > 
> > 
> > Rich Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> > > On 28 Oct 2003 17:55:10 -0800, [EMAIL PROTECTED] (Scott
> > > Edwards  Tallahassee,FL) wrote:
> > > 
> > > > Does anyone know of a sampling formula to use for a proportion when
> > > > you have 'clusters' of dependent cases in the population?
> > > > I have to calculate a proportion of certain criteria that are met upon
> > > > inspection of records in an agency.  The problem is that a single
> > > > employee may complete 10-20 records, therefore the assumption of
> > > > independence of cases is blown.  Therefore, I can't use the old
> > > > tried-and-true sampling formula:
> > > > 
> > > >           2 
> > > >          t PQ                   n=sample size
> > > >          ----                   t=t-value for desired confidence level
> > > >            2                      (e.g. 1.96 for 95%)
> > > >           d                     P=proportion with trait in population
> > > > n=  --------------------         (if unknown, 50% most conservative)
> > > >                 2               Q=1-P
> > > >         1      t PQ             d=desired precision
> > > >     1 + -- * ( -----  -  1 )     (confidence int = proportion +-d)
> > > >         N        2              N=population size
> > > >                 d
> > > > 
> > > > which is used for, like, political polls where every case is an
> > > > independent person.
> > > > What do I do?
> > > 
> > > You are asking for n, for the planning of a survey among N, 
> > > and your formula is using Finite population correction.
> > > You can check with groups.google and see how often 
> > > I have told  people that FPC  is *usually*  a bad idea.   
> > 
> >   I was unaware of this. I will check your former messages.
> > This appeared to be the 'standard' formula for sample size calculation
> > when you are interested in a proportion of items that pass/fail, *and
> > you have independence* (e.g. political polls), so I'm afraid that many
> > of us are making this error. However, I am definitely not tied to this
> > formula and am just looking for a method to get the job done as
> > accurately as possible.
> >   
> > > However, from what you say, I can see that you *might*  
> > > have an application that calls for it.  On the other hand,
> > > if you are trying to meet the requirement of state or
> > > federal regulations, the procedures are probably 
> > > spelled out in detail.  If you are trying to create methods
> > > for a regulatory system, then you need more consultation
> > > than you can get  for free by e-mail.
> > 
> >   Actually, I was attempting to state why I *couldn't* use this
> > formula, since it assumes independence, which I must be very suspect
> > of.
> > Regarding your comment on regulation, this is simply a data analysis
> > problem - I'm not clear why the issue of regulation is relevant.  If
> > the methodology had been laid out in a regulation then I definitely
> > wouldn't be wasting your guys time asking help in formulating one. 
> > The problem from a research design/analysis standpoint doesn't strike
> > me as *that* unusual.  I've read many times of how analyses must be
> > adjusted due to 'clusters' of data points that are not independent
> > (eg. effect of temperature on performance of athletes measured
> > multiple time), I just haven't seen how to approach it from a sample
> > size determination perspective.  Actually, it occurs to me that
> > perhaps I should be conceptualizing the design as one of
> > repeated-measures (each employee being a subject, with each record
> > being evaluated being a point of measurement) - would that perhaps
> > clarify how to determine the sample size?  The problem of course is
> > that I must end up with a point estimate and confidence interval for
> > the entire _agency_, rather than looking at the difference between
> > subjects, as in a typical repeated measures experiment.  Still seems
> > like it could be done.
> > 
> >   WIth regard to free vs. paid help, I certainly have to objection to
> > people being paid for their time, and would have no objection to
> > pursue that path if I had sufficient resources to pay someone
> > $100/hour.  However, I posted it here for two additional important
> > reasons.
> > 
> >     1.  I wouldn't be sure who to approach to pay - I've tried all my
> > stat friends and colleagues to no avail - and by posting it here I
> > thought that I would approach the largest possible audience.
> > 
> >     2.  I was under the impression that this group was for the purpose
> > of discussing interesting statistical issues/problems that had
> > applicablility beyond the specific problem.  Perhaps I'm missing
> > something, but I'm unable to see your perspective that this problem
> > would only come up in the context of 'regulation' - to the contrary,
> > it seems to me it would come up in many instances of evaluating
> > organizations as a whole, with many individuals, performing multiple
> > tasks (e.g a factory, with many employees, making many widgets each
> > and you wanted to estimate *factory-wide* the proportion of defects in
> > the widgets that was occurring - this is the *exact* same problem that
> > I have)
> > 
> >  
> > > The first thing I would do is check whether there 
> > > actually is *dependence*.  
> > > You can't assume independence,
> > > but you may be able to demonstrate it.
> > 
> > Unfortunately, I don't have a sample of the data to estimate the
> > degree of dependence.  Plus, if I wanted to go out and get such a
> > sample, I'd have to bug you guys for help on how to determine *that*
> > sample size.  :)
> > 
> > > If the people's responses are not independent, that 
> > > changes the *sort*  of statement that you should make/
> > >  - Is there a correlation of 'p'  with N?  
> > 
> > Do you mean lower case 'p' (i.e. p-value) or the P in the formula
> > above?
> > 
> > >  - Is this <something> bad, or neutral -- that is, do you *have*  to
> > > place a limit on it, or should you seek a neutral statement 
> > > of what exists, which gives the best feel for the distribution?
> > > (I am asserting that the mean and CI  is apt to be a thin 
> > > statement, if you are looking for understanding.)
> > 
> > Not sure if I understand this part completely, but I definitely am
> > looking for a neutral, unbiased statement of my best guess of what
> > *exists* (with a confidence level and interval of course)
> > 
> > > Hope this helps.
> > 
> > Thanks for your time,
> > 
> > Scott Edwards
> > .
> > .
> > =================================================================
> > Instructions for joining and leaving this list, remarks about the
> > problem of INAPPROPRIATE MESSAGES, and archives are available at:
> > .                  http://jse.stat.ncsu.edu/                    .
> > =================================================================
> 
> 
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> .                  http://jse.stat.ncsu.edu/                    .
> =================================================================
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: sampling formula for proportion without independence?

Reply via email to