Marc Schwartz wrote:
On Aug 3, 2009, at 12:06 AM, Thomas Mang wrote:
Hi,
Suppose a binomial GLM with both continuous as well as categorical
predictors (sometimes referred to as GLM-ANCOVA, if I remember
correctly). For the categorical predictors = indicator variables, is
then there a suggested minimum frequency of each level ? Would such a
rule/ recommendation be dependent on the y-side too ?
Example: N is quite large, a bit > 100. Observed however are only 0/1s
(so Bernoulli random variables, not Binomial, because the covariates
are from observations and in general always different between
observations). There are two categorical predictors, each with 2
levels. It would structurally probably also make sense to allow an
interaction between those, yielding de facto a single categorical
predictor with 4 levels. Is then there a minimum of observations
falling in each of the 4 level category (either absolute or relative),
or also that plus also considering the y-side ?
Must be the day for sample size questions for logistic regression. A
similar query is on MedStats today.
The typical minimum sample size recommendation for logistic regression
is based upon covariate degrees of freedom (or columns in the model
matrix). The guidance is that there should be 10 to 20 *events* per
covariate degree of freedom.
So if you have 2 factors, each with two levels, that gives you two
covariate degrees of freedom total (two columns in the model matrix). At
the high end of the above range, you would need 40 events in your sample.
If the event incidence in your sample is 10%, you would need 400 cases
to observe 40 events to support the model with the two two-level
covariates (Y ~ X1 + X2).
An interaction term (in addition to the 2 main effect terms, Y ~ X1 *
X2) in this case would add another column to the model matrix, thus, you
would need an additional 20 events, or another 200 cases in your sample.
So you could include the two two-level factors and the interaction term
if you have 60 events, or in my example, about 600 cases.
Thanks for that. I suppose your term 'event' does not refer to a
technical thing of GLMs, so I assume that both the number of observed 0s
_or_ 1s have to be >= 10 / 20 for each df (since it's arbitrary what of
them is the event, and what is the non-event).
OK, two questions: The model also contains continuous predictors (call
them W, so the model is Y ~ X1*X2 + W. Does the same apply here too ->
for each df of these, 10-20 more events? [If the answer to the former
yes, this question is now redundant:] If there are interactions between
the continuous covariates and a categorical predictor (Y ~ X1 * (X2 +
W), how many more events do I need? Does the rule for the categorical
predictors count, or that for the continuous covariates ?
many thanks !
Thomas
If you include the interaction term only in the absence of the main
effects (Y ~ X1:X2), that would yield 4 columns in the model matrix,
requiring 80 events, or about 800 cases. Without more details (eg. your
underlying hypothesis), it is not clear to me that you gain anything
here as compared to the use of the main effects and potentially, the
interaction term together, and you certainly lose in terms of model
interpretation and requiring a notably larger sample size.
Relative to a minimum sample size for each of the levels in the factor
based covariates, I am not aware of any specific guidance there, short
of dealing with empty cells at the extreme. However, there are methods
to assess covariate complexity and the consideration for the collapsing
of factor levels. For more details on these issues, I would refer you to
Frank's book, Regression Modeling Strategies, specifically to chapters 4
and 10-12. The former focuses on general multivariable strategies and
the latter focuses on LR. More information here:
http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/RmS
HTH,
Marc Schwartz
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.