On Aug 5, 2009, at 12:51 AM, Thomas Mang wrote:
Marc Schwartz wrote:
On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
Hi,
Suppose a binomial GLM with both continuous as well as categorical
predictors (sometimes referred to as GLM-ANCOVA, if I remember
correctly). For the categorical predictors = indicator variables,
is then there a suggested minimum frequency of each level ? Would
such a rule/ recommendation be dependent on the y-side too ?
Example: N is quite large, a bit > 100. Observed however are only
0/1s (so Bernoulli random variables, not Binomial, because the
covariates are from observations and in general always different
between observations). There are two categorical predictors, each
with 2 levels. It would structurally probably also make sense to
allow an interaction between those, yielding de facto a single
categorical predictor with 4 levels. Is then there a minimum of
observations falling in each of the 4 level category (either
absolute or relative), or also that plus also considering the y-
side ?
Must be the day for sample size questions for logistic regression.
A similar query is on MedStats today.
The typical minimum sample size recommendation for logistic
regression is based upon covariate degrees of freedom (or columns
in the model matrix). The guidance is that there should be 10 to 20
*events* per covariate degree of freedom.
So if you have 2 factors, each with two levels, that gives you two
covariate degrees of freedom total (two columns in the model
matrix). At the high end of the above range, you would need 40
events in your sample.
If the event incidence in your sample is 10%, you would need 400
cases to observe 40 events to support the model with the two two-
level covariates (Y ~ X1 + X2).
An interaction term (in addition to the 2 main effect terms, Y ~ X1
* X2) in this case would add another column to the model matrix,
thus, you would need an additional 20 events, or another 200 cases
in your sample.
So you could include the two two-level factors and the interaction
term if you have 60 events, or in my example, about 600 cases.
Thanks for that. I suppose your term 'event' does not refer to a
technical thing of GLMs, so I assume that both the number of
observed 0s _or_ 1s have to be >= 10 / 20 for each df (since it's
arbitrary what of them is the event, and what is the non-event).
Sorry for any confusion. In my applications (clinical), we are
typically modeling/predicting the probability of a discrete event (eg.
death, stroke, repeat intervention) or more generally perhaps, the
presence/absence of some characteristic (eg. renal failure). So I
think in terms of events, which more generally then also corresponds
to Cox regression, where similar 'event'/sample size guidelines are in
place when looking at time based event models.
As you note, the count/sample size requirements importantly refer to
the smaller incidence/proportion of the two possible response variable
values. So you may be interested in modeling/predicting a response
value that has a probability of 0.7, but the requirements will be
based upon the 0.3 probability response value.
OK, two questions: The model also contains continuous predictors
(call them W, so the model is Y ~ X1*X2 + W. Does the same apply
here too -> for each df of these, 10-20 more events? [If the answer
to the former yes, this question is now redundant:] If there are
interactions between the continuous covariates and a categorical
predictor (Y ~ X1 * (X2 + W), how many more events do I need? Does
the rule for the categorical predictors count, or that for the
continuous covariates ?
I tend to think in terms of the number of columns that would be in the
model matrix, where each column corresponds to one covariate degree of
freedom. So if you create a model matrix using contrived data that
reflects your expected actual data, along with a given formula, you
can perhaps better quantify the requirements. See ?model.matrix for
more information.
Each continuous variable as a main effect term, creates a single
column in the model matrix, therefore adds one degree of freedom,
requiring 10-20 'events' for each and the corresponding increase in
the number of total cases.
A single interaction term between a factor and a continuous variable
(Factor * Continuous) results in 'nlevels(factor) - 1' additional
columns in the model matrix. So again, for each additional column, the
'event'/sample size requirements are in place.
Of course, more complex interaction terms and formulae will impact the
model matrix accordingly, so as noted, it may be best to create one
using dummy data, if your model formulae will be more complicated.
HTH,
Marc Schwartz
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.