David Judkins wrote: > Raquel, > > Your problem is typical of the class of problems that I have been > working on for about 15 years now. You can look up my imputation papers > in the CIS. None of the currently available (free or marketed) > software solutions known to me are designed to preserve the structure of > general multivariate data. The ones that build models of multivariate > relationships are mostly designed for either normal or binary data. > Programs designed for general data are usually designed to impute a > single variable at a time and generally fail to preserve multivariate > structure. If you have the luxury of a large programming budget, you > could program the algorithms that some of us here at Westat have > developed and published.
David, In theory you are correct, but I think your note slightly misses the point. It is amazing how well the chained equations approach of MICE and my aregImpute function work, given they were not designed to preserve the multivariate structure. And they make fewer assumptions. I am particularly dubious about any methods that assume linearity and multivariate normality. aregImpute uses Fisher's optimum scoring algorithm to impute nominal variables. If predictive mean matching is used with aregImpute (a more nonparametric approach not available with your multivariate approach), the distribution of imputed categories is quite sensible. Frank Harrell > > As Alan replied, however, given that all your individual item rates are > low, perhaps one of the available solutions would work reasonably well > for you. > > It sounds as if you don't have any skip patterns. If so, you could just > impute the mode for each variable. A second solution that is only a > little more complicated would be to independently impute each variable > by a simple hotdeck. Either way, you end up with 100% complete vectors. > You don't have to do any rounding. All variables have permissible > values. You will have better marginal distributions with independent > hotdecks than you get by imputing modes. > > But neither solution protects multivariate structure. Here is a bit > more complicated solution that tries to do that but is still fairly > simple: > > Pick a single variable as the most important for your analyses. Call it > Y. Let S be the maximum set of variables with zero item nonresponse. > Build the best model for Y in terms of S that you can. (Doesn't have to > be a linear model.) Output predicted values of Y for the whole sample. > Call them Ypred. Let O be the maximum set of cases with zero > nonresponse on all variables. Find the nearest neighbor in O for each > case with one or more missing values. So then you have a donor case and > a recipient case. Let X1i,...,Xpi be the set of variables on recipient > case i with missing values. Let X1j,...,Xpj be the corresponding set of > variables on the donor case. Impute Xki=Xkj for k=1,...,p. > > To the extent that the variables in S are good predictors of Y and to > the extent that the other variables are related to Y, you should get > slightly better preservation of covariances than with independent > hotdecks. There are many variants on this theme. You will still have > some fading of multivariate structure, however. And you will > under-estimate post-imputation variances. > > For combining hotdecks with multiple imputation, see the exciting new > papers by Siddique and Belin and by Little, Yosef, Cain, Nan, and > Harlow, both in the first issue of volume 27 of Statistics in Medicine. > > > > --Dave > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Alan > Zaslavsky > Sent: Wednesday, January 02, 2008 10:07 AM > To: [email protected]; [email protected] > Subject: [Impute] Rounding option on PROC MI and choosing a final MI > dataset > > >> From: "Raquel Hampton" <[email protected]> >> Subject: [Impute] Rounding option on PROC MI and choosing a final MI >> dataset >> My first question is: there is a round option for PROC MI, but I read > in >> an article (Horton, N.J., Lipsitz, S.P., & Parzen, M. (2003). A >> potential for bias when rounding in multiple imputation. The American >> Statistician 57(4), 229-232) that using the round option for > categorical >> data (the items have nominal responses, ranging from 1 to 5) produces >> bias estimates, though logical. So what can be done? I only have > access >> to SAS and STATA, but I am not very familar with STATA. Will this not >> be such a problem since the proportion of missing for each individual >> item is small? > > Do you really mean nominal (unordered categories, like French, German, > English, or chocolate, vanilla, strawberry) or ordinal (like poor, fair, > good, excellent)? If nominal, you won't get anything sensible by > fitting > a normal model and rounding. If ordinal and well distributed across the > categories, the bias of using rounded data will be less than with the > binomial data primarily considered by the Horton et al. article. > > You might also consider whether it is necessary to round at all -- > depends on how the data will be used in further analyses. > > With only a couple of percent missing on each item, all of the issues > about imputation become less crucial, although as noted in a previous > response you should definitely run the proper MI analysis to verify that > the between-imputation contribution to variance is small. In practice > any modeling exercise is a compromise involving putting more effort into > the important aspects of the modeling and in this case this might not > require doing the most methodologically advanced things with the > imputation. > > _______________________________________________ > Impute mailing list > [email protected] > http://lists.utsouthwestern.edu/mailman/listinfo/impute > > _______________________________________________ > Impute mailing list > [email protected] > http://lists.utsouthwestern.edu/mailman/listinfo/impute > -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University
