Re: sampling formula for proportion without independence?

kjetil Thu, 30 Oct 2003 13:06:07 -0800

On 29 Oct 2003 at 19:49, Scott Edwards  Tallahassee,FL wrote:

I would attck this kind of problem with simulation, much faster than 
trying to get at an formula. Your problem could be frased as a 
multilevel problem, so you could find more help at the multilevel 
list:
[EMAIL PROTECTED]


I would approch simulation using a binomial mixec model, with groups 
the individuals, using R:

library(nlme)
library(MASS)
> m <- 20    # records per person
> n <- 100   # number of persons
> a <- log(0.1/0.9) # logit model for p=0.1
> sdepsilon <- 0.2   # standard deviation on logit scale
> Y <- numeric(n*m)  # binary observation vector
> id <- factor(rep(1:n, rep(m,n))   )  # group identification
> for (i in 1:n) {   # simulation
+     epsilon <- rnorm(1, sd=sdepsilon)
+     p <- exp( a + epsilon)/(1+exp(a+epsilon))
+     for (j in 1:m) {
+          Y[(i-1)*m + j] <- rbinom(1,1,p)
+ }}
mod1 <- glmmPQL(fixed= Y ~ 1, random = ~ 1 | id, family=binomial)
> intervals(mod1)
Approximate 95% confidence intervals

 Fixed effects:
                lower      est.     upper
(Intercept) -2.326024 -2.180668 -2.035312        # this is the        
                                             # interval for "a "
                                             # whjich you are         
                                     #interested  in
attr(,"label")
[1] "Fixed effects:"

 Random Effects:
  Level: id 
                       lower       est.    upper
sd((Intercept)) 1.030459e-05 0.03553273 122.5254

 Within-group standard error:
    lower      est.     upper 
0.9693968 0.9998783 1.0313182 


Note that the interval is on logit scale, you must tarnsform to get 
it on probability scale.

Hope this helps, 

Kjetil Halvorsen


> Rich,
> Thanks for your response.  See comments/questions interspersed below.
> Scott
> 
> 
> Rich Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> > On 28 Oct 2003 17:55:10 -0800, [EMAIL PROTECTED] (Scott
> > Edwards  Tallahassee,FL) wrote:
> > 
> > > Does anyone know of a sampling formula to use for a proportion when
> > > you have 'clusters' of dependent cases in the population?
> > > I have to calculate a proportion of certain criteria that are met upon
> > > inspection of records in an agency.  The problem is that a single
> > > employee may complete 10-20 records, therefore the assumption of
> > > independence of cases is blown.  Therefore, I can't use the old
> > > tried-and-true sampling formula:
> > > 
> > >           2 
> > >          t PQ                   n=sample size
> > >          ----                   t=t-value for desired confidence level
> > >            2                      (e.g. 1.96 for 95%)
> > >           d                     P=proportion with trait in population
> > > n=  --------------------         (if unknown, 50% most conservative)
> > >                 2               Q=1-P
> > >         1      t PQ             d=desired precision
> > >     1 + -- * ( -----  -  1 )     (confidence int = proportion +-d)
> > >         N        2              N=population size
> > >                 d
> > > 
> > > which is used for, like, political polls where every case is an
> > > independent person.
> > > What do I do?
> > 
> > You are asking for n, for the planning of a survey among N, 
> > and your formula is using Finite population correction.
> > You can check with groups.google and see how often 
> > I have told  people that FPC  is *usually*  a bad idea.   
> 
>   I was unaware of this. I will check your former messages.
> This appeared to be the 'standard' formula for sample size calculation
> when you are interested in a proportion of items that pass/fail, *and
> you have independence* (e.g. political polls), so I'm afraid that many
> of us are making this error. However, I am definitely not tied to this
> formula and am just looking for a method to get the job done as
> accurately as possible.
>   
> > However, from what you say, I can see that you *might*  
> > have an application that calls for it.  On the other hand,
> > if you are trying to meet the requirement of state or
> > federal regulations, the procedures are probably 
> > spelled out in detail.  If you are trying to create methods
> > for a regulatory system, then you need more consultation
> > than you can get  for free by e-mail.
> 
>   Actually, I was attempting to state why I *couldn't* use this
> formula, since it assumes independence, which I must be very suspect
> of.
> Regarding your comment on regulation, this is simply a data analysis
> problem - I'm not clear why the issue of regulation is relevant.  If
> the methodology had been laid out in a regulation then I definitely
> wouldn't be wasting your guys time asking help in formulating one. 
> The problem from a research design/analysis standpoint doesn't strike
> me as *that* unusual.  I've read many times of how analyses must be
> adjusted due to 'clusters' of data points that are not independent
> (eg. effect of temperature on performance of athletes measured
> multiple time), I just haven't seen how to approach it from a sample
> size determination perspective.  Actually, it occurs to me that
> perhaps I should be conceptualizing the design as one of
> repeated-measures (each employee being a subject, with each record
> being evaluated being a point of measurement) - would that perhaps
> clarify how to determine the sample size?  The problem of course is
> that I must end up with a point estimate and confidence interval for
> the entire _agency_, rather than looking at the difference between
> subjects, as in a typical repeated measures experiment.  Still seems
> like it could be done.
> 
>   WIth regard to free vs. paid help, I certainly have to objection to
> people being paid for their time, and would have no objection to
> pursue that path if I had sufficient resources to pay someone
> $100/hour.  However, I posted it here for two additional important
> reasons.
> 
>     1.  I wouldn't be sure who to approach to pay - I've tried all my
> stat friends and colleagues to no avail - and by posting it here I
> thought that I would approach the largest possible audience.
> 
>     2.  I was under the impression that this group was for the purpose
> of discussing interesting statistical issues/problems that had
> applicablility beyond the specific problem.  Perhaps I'm missing
> something, but I'm unable to see your perspective that this problem
> would only come up in the context of 'regulation' - to the contrary,
> it seems to me it would come up in many instances of evaluating
> organizations as a whole, with many individuals, performing multiple
> tasks (e.g a factory, with many employees, making many widgets each
> and you wanted to estimate *factory-wide* the proportion of defects in
> the widgets that was occurring - this is the *exact* same problem that
> I have)
> 
>  
> > The first thing I would do is check whether there 
> > actually is *dependence*.  
> > You can't assume independence,
> > but you may be able to demonstrate it.
> 
> Unfortunately, I don't have a sample of the data to estimate the
> degree of dependence.  Plus, if I wanted to go out and get such a
> sample, I'd have to bug you guys for help on how to determine *that*
> sample size.  :)
> 
> > If the people's responses are not independent, that 
> > changes the *sort*  of statement that you should make/
> >  - Is there a correlation of 'p'  with N?  
> 
> Do you mean lower case 'p' (i.e. p-value) or the P in the formula
> above?
> 
> >  - Is this <something> bad, or neutral -- that is, do you *have*  to
> > place a limit on it, or should you seek a neutral statement 
> > of what exists, which gives the best feel for the distribution?
> > (I am asserting that the mean and CI  is apt to be a thin 
> > statement, if you are looking for understanding.)
> 
> Not sure if I understand this part completely, but I definitely am
> looking for a neutral, unbiased statement of my best guess of what
> *exists* (with a confidence level and interval of course)
> 
> > Hope this helps.
> 
> Thanks for your time,
> 
> Scott Edwards
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> .                  http://jse.stat.ncsu.edu/                    .
> =================================================================


.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
.                  http://jse.stat.ncsu.edu/                    .
=================================================================

Re: sampling formula for proportion without independence?

Reply via email to