On 29 Oct 2003 at 19:49, Scott Edwards Tallahassee,FL wrote:
I would attck this kind of problem with simulation, much faster than
trying to get at an formula. Your problem could be frased as a
multilevel problem, so you could find more help at the multilevel
list:
[EMAIL PROTECTED]
I would approch simulation using a binomial mixec model, with groups
the individuals, using R:
library(nlme)
library(MASS)
> m <- 20 # records per person
> n <- 100 # number of persons
> a <- log(0.1/0.9) # logit model for p=0.1
> sdepsilon <- 0.2 # standard deviation on logit scale
> Y <- numeric(n*m) # binary observation vector
> id <- factor(rep(1:n, rep(m,n)) ) # group identification
> for (i in 1:n) { # simulation
+ epsilon <- rnorm(1, sd=sdepsilon)
+ p <- exp( a + epsilon)/(1+exp(a+epsilon))
+ for (j in 1:m) {
+ Y[(i-1)*m + j] <- rbinom(1,1,p)
+ }}
mod1 <- glmmPQL(fixed= Y ~ 1, random = ~ 1 | id, family=binomial)
> intervals(mod1)
Approximate 95% confidence intervals
Fixed effects:
lower est. upper
(Intercept) -2.326024 -2.180668 -2.035312 # this is the
# interval for "a "
# whjich you are
#interested in
attr(,"label")
[1] "Fixed effects:"
Random Effects:
Level: id
lower est. upper
sd((Intercept)) 1.030459e-05 0.03553273 122.5254
Within-group standard error:
lower est. upper
0.9693968 0.9998783 1.0313182
Note that the interval is on logit scale, you must tarnsform to get
it on probability scale.
Hope this helps,
Kjetil Halvorsen
> Rich,
> Thanks for your response. See comments/questions interspersed below.
> Scott
>
>
> Rich Ulrich <[EMAIL PROTECTED]> wrote in message news:<[EMAIL PROTECTED]>...
> > On 28 Oct 2003 17:55:10 -0800, [EMAIL PROTECTED] (Scott
> > Edwards Tallahassee,FL) wrote:
> >
> > > Does anyone know of a sampling formula to use for a proportion when
> > > you have 'clusters' of dependent cases in the population?
> > > I have to calculate a proportion of certain criteria that are met upon
> > > inspection of records in an agency. The problem is that a single
> > > employee may complete 10-20 records, therefore the assumption of
> > > independence of cases is blown. Therefore, I can't use the old
> > > tried-and-true sampling formula:
> > >
> > > 2
> > > t PQ n=sample size
> > > ---- t=t-value for desired confidence level
> > > 2 (e.g. 1.96 for 95%)
> > > d P=proportion with trait in population
> > > n= -------------------- (if unknown, 50% most conservative)
> > > 2 Q=1-P
> > > 1 t PQ d=desired precision
> > > 1 + -- * ( ----- - 1 ) (confidence int = proportion +-d)
> > > N 2 N=population size
> > > d
> > >
> > > which is used for, like, political polls where every case is an
> > > independent person.
> > > What do I do?
> >
> > You are asking for n, for the planning of a survey among N,
> > and your formula is using Finite population correction.
> > You can check with groups.google and see how often
> > I have told people that FPC is *usually* a bad idea.
>
> I was unaware of this. I will check your former messages.
> This appeared to be the 'standard' formula for sample size calculation
> when you are interested in a proportion of items that pass/fail, *and
> you have independence* (e.g. political polls), so I'm afraid that many
> of us are making this error. However, I am definitely not tied to this
> formula and am just looking for a method to get the job done as
> accurately as possible.
>
> > However, from what you say, I can see that you *might*
> > have an application that calls for it. On the other hand,
> > if you are trying to meet the requirement of state or
> > federal regulations, the procedures are probably
> > spelled out in detail. If you are trying to create methods
> > for a regulatory system, then you need more consultation
> > than you can get for free by e-mail.
>
> Actually, I was attempting to state why I *couldn't* use this
> formula, since it assumes independence, which I must be very suspect
> of.
> Regarding your comment on regulation, this is simply a data analysis
> problem - I'm not clear why the issue of regulation is relevant. If
> the methodology had been laid out in a regulation then I definitely
> wouldn't be wasting your guys time asking help in formulating one.
> The problem from a research design/analysis standpoint doesn't strike
> me as *that* unusual. I've read many times of how analyses must be
> adjusted due to 'clusters' of data points that are not independent
> (eg. effect of temperature on performance of athletes measured
> multiple time), I just haven't seen how to approach it from a sample
> size determination perspective. Actually, it occurs to me that
> perhaps I should be conceptualizing the design as one of
> repeated-measures (each employee being a subject, with each record
> being evaluated being a point of measurement) - would that perhaps
> clarify how to determine the sample size? The problem of course is
> that I must end up with a point estimate and confidence interval for
> the entire _agency_, rather than looking at the difference between
> subjects, as in a typical repeated measures experiment. Still seems
> like it could be done.
>
> WIth regard to free vs. paid help, I certainly have to objection to
> people being paid for their time, and would have no objection to
> pursue that path if I had sufficient resources to pay someone
> $100/hour. However, I posted it here for two additional important
> reasons.
>
> 1. I wouldn't be sure who to approach to pay - I've tried all my
> stat friends and colleagues to no avail - and by posting it here I
> thought that I would approach the largest possible audience.
>
> 2. I was under the impression that this group was for the purpose
> of discussing interesting statistical issues/problems that had
> applicablility beyond the specific problem. Perhaps I'm missing
> something, but I'm unable to see your perspective that this problem
> would only come up in the context of 'regulation' - to the contrary,
> it seems to me it would come up in many instances of evaluating
> organizations as a whole, with many individuals, performing multiple
> tasks (e.g a factory, with many employees, making many widgets each
> and you wanted to estimate *factory-wide* the proportion of defects in
> the widgets that was occurring - this is the *exact* same problem that
> I have)
>
>
> > The first thing I would do is check whether there
> > actually is *dependence*.
> > You can't assume independence,
> > but you may be able to demonstrate it.
>
> Unfortunately, I don't have a sample of the data to estimate the
> degree of dependence. Plus, if I wanted to go out and get such a
> sample, I'd have to bug you guys for help on how to determine *that*
> sample size. :)
>
> > If the people's responses are not independent, that
> > changes the *sort* of statement that you should make/
> > - Is there a correlation of 'p' with N?
>
> Do you mean lower case 'p' (i.e. p-value) or the P in the formula
> above?
>
> > - Is this <something> bad, or neutral -- that is, do you *have* to
> > place a limit on it, or should you seek a neutral statement
> > of what exists, which gives the best feel for the distribution?
> > (I am asserting that the mean and CI is apt to be a thin
> > statement, if you are looking for understanding.)
>
> Not sure if I understand this part completely, but I definitely am
> looking for a neutral, unbiased statement of my best guess of what
> *exists* (with a confidence level and interval of course)
>
> > Hope this helps.
>
> Thanks for your time,
>
> Scott Edwards
> .
> .
> =================================================================
> Instructions for joining and leaving this list, remarks about the
> problem of INAPPROPRIATE MESSAGES, and archives are available at:
> . http://jse.stat.ncsu.edu/ .
> =================================================================
.
.
=================================================================
Instructions for joining and leaving this list, remarks about the
problem of INAPPROPRIATE MESSAGES, and archives are available at:
. http://jse.stat.ncsu.edu/ .
=================================================================