Mulitple Imputation in a very simple situation: just two variables

Hoogendoorn, Adriaan Wed, 18 Nov 2009 03:15:36 -0800

Dear Listserv,

I would like to know in which situations the Multiple Imputation method works 
well when I have just two variables


I did the following simulation study: I generated (X,Y) being 100 draws from 
the bivariate normal distribution with standard normal margins and a 
correlation coefficient of .7.
Next I created missings in four different ways, creating

1. missing X's, depending on the value of Y (MAR)
2. missing Y's, depending on the value of X (MAR)
3. missing Y's, depending on the value of Y (MNAR)
4. missing X's, depending on the value of X (MNAR)

Here, I was motivated by the (in my view very nice) blood pressure example that 
Schafer and Graham (2002) use to illustrate differences between MCAR, MAR and 
NMAR.
As far as I understood, the first two missing data mechanisms are MAR and the 
latter two are MNAR.
As Schafer and Graham did, I used a very rigorous method in creating missing 
values by chopping off a part of the bivariate normal distribution.
In more detail: I created missing values if X (or Y) had a value below 0.5. 
This resulted in about 70% missing values, wihich could be expected from the 
standard normal distribution.

Note that for the COMPLETE DATA, the scatter diagrams of 1 and 3 are identical 
and show the top slice of the bivariate normal distribution.
Also the scatter diagrams 2 and 4 are identical and show the right-end slice of 
the bivariate normal distribution.
The scatter diagrams suggest that regressing y on x using complete case 
analysis will fail in cases 1 and 3: the top slice of the bivariate normal 
tilts the regression line and results in a biased regression coefficient 
estimate. The scatter diagrams also suggest that regressing y on x using 
complete case analysis may work well in cases 2 and 4, where missingness 
depends on X.
These suggestions were confirmed by the simulation study:
The mean regression coefficient (over 2000 simulations) came out to be .29, 
showing a serious bias from the true value of .7, in case 1 and 3, i.e. when 
missingness depends on Y.
Case 1 illustrates Allisons' claim that ".. if the data are not MCAR, but only 
MAR, listwise deletion can yield biased estimates." (see Allison (2001) page 6)
When missingness depends on X, the mean regression coefficient came out to be 
.70 and is unbiased. Again this confirms one of Allison's claims: "... if the 
probability of missing data on any of the independent variables does noet 
depend on the values fo the dependent variable, then regression estimates using 
listwise deletion will be unbiased ... (see Allison (2001) page 6-7)


Now comes the interesting part where I using Mulitiple Imputation (detail: I 
used Stata's "ice"procedure and was able to replicate the results using Stata's 
"mi impute mvn")
I found the following results

1. b = .59 (averaged over 2000 simulations)
2. b = .70
3. b = .29
4. b = .89

My point is: Case 1 shows a bias!
Althoug substantially smaller than complete case analysis (where b = .29), I 
still obtain a bias of .11.
I would have expected, since case 1 is a case of MAR, that Multiple Imputaion 
would provide an unbiased estimate.

Do you have any clues why this happens?

I modified the simulation study by replacing the cut-off value in the missing 
data mechanism by a stochastic selection mechanism depending on X (or Y) but 
found similar results.

Kind regards,
Adriaan Hoogendoorn
GGZ inGeest
Amstedam


Reference:
Schafer, J.L. & J.W. Graham (2002), Missing Data: Our View of the State of the 
Art, Psychological Methods, 147-177
Allisson, P.D. (2001), Missing Data, Sage Pub., Thousand Oaks, CA.

Mulitple Imputation in a very simple situation: just two variables

Reply via email to