After literally months of work, I've finally managed to use multiple
imputation. I'm so close now, so I'd be really grateful for any help. I
want to include an interaction between a continuous and a categorical
variable.
I have a rather large cross-sectional population study performed at two
time points in the same population (n=40,000). Completely observed
variables are: age (continuous), gender (M/F), time (1/2). Variables with
missing values include various symptoms (e.g. asthma) and risk factors
(0/1) (6 all in all). Unit response rate is about 80%, and unit
nonresponders are included in the data. Item nonresponse varies from 1% to
30%. Hardware and software is Schafer's MIX, SPLUS 2000, Win98, 1Ghz, 384
Mb RAM. Using restricted models: initial estimates with ECM, data
augmentation with DABIPF.
The main point of the analysis is to estimate the risk of asthma (or other
symptom/diagnosis) by age, separately for the two time points. This is done
using logistic regression, with outcome asthma and main effects of age
(cubic spline), time(dummy 0/1) and the interaction with age(splined) and
time(dummy).
I'm having some trouble producing an imputation model that includes this
interaction. Since age is a continuous variable, an "obvious" possible
model is:
-categorical/loglinear W: gender, time
-continuous Z: age, asthma and the other symptoms
-design matrix: gender, time, gender*time (margins: don't matter)
Am I right that this model does not properly include the time by age
interaction?
So what I've done is use age as a categorical variable instead:
-categorical/loglinear W: gender, time, age as a categorical variable
(dummy coded)
-continuous Z: asthma and the other symptoms and risk factors (6 all in
all)
-margins: "main effects" only, and design matrix: time, gender, age
(dummy), time*age(dummy)
A model that does not include the time*age interactions runs fine. When use
the model with time*age interactions and try to find ECM estimates
(ecm.mix) the first step produces mu and sigma components that are all NA.
The data are all coded correctly (1/2) and I've always run rngseed() and
prelim.mix() first.I've tried different number of age groups (3,4, 5 age
groups and each age as a category (56)), different priors for ECM and
different random number seeds (which shouldn't matter), tried excluding
unit nonresponders, tried taking the logarithm of the "continuous"
variables. I've also tried imputing only for asthma, to no avail. When I
use only unit responders (imputing only item nonresponse) I get the
following warning, then an application halt: Warning: .Fortran("mstepcm",:
sqrt(-7.27596e-0.12): DOMAIN error.
Do you have any suggestions ? Have I dug myself too deep - am I
misinterpreting the general location model ? I'm grateful for any help.
Jan Brogger, MD
University of Bergen, Norway