Dear all: I worked on imputing income in the dim and distant past, and 
have a few comments on this email stream:

1. I did some work with Martin David and others (David et al 1986 JASA 
29-41) on modeling log income, and we found very extreme imputations of 
income when untying the log transform. This was caused by violation of 
the constant variance assumption, which led to too much noise being 
added to the large predicted mean log income values, leading to 
problems when exponentiating. This led us to stratify on predicted log 
income when adding residuals to the predictions, a method very close to 
predictive mean matching. Modeling income as the outcome avoids the 
problem with the transformation, but it is not very appealing for the 
obvious reasons. We found that we got reasonable R-squareds in our 
regressions, but we had a lot of covariates and worked quite hard on 
specification. An alternative approach to wages and salary income is to 
model the hourly wage rate by linear regression, weighting cases by the 
number of hours worked.

2. It is quite feasible to do multiple imputation and predictive mean 
matching, providing more than one match is found for each missing  
value. This was  proposed in Little (1988 JBES 287-301), which also 
considered booostrapping the set of matches to make the method proper 
in Rubin's sense. (BTW I believe this was the paper that coined the 
term "predictive mean matching"). Of course there is a penalty to be 
paid in terms of quality of the matches, but to my mind this is more 
than compensated by gains in efficiency in the MI estimate, and the 
propagation of imputation error. I know of people currently interested 
in refinements of these methods, but not sure if anything has been 
published.

3. You might also look at work by Schenker, Taylor and Lazzeroni (some 
permutation of these authors) back in the 90's comparing imputation by 
matching methods and parametric regression models. In general I like 
matching methods with a lot of data, and more parametric methods with 
smaller data sets.

4.  Nat Schenker, Raghunathan and colleagues have been working on 
income imputation for NICHS surveys, in a way that deals with the 
issues of household clustering. I'm not sure if the refereed version of 
this work is published yet but it should be available in recent JSM 
proceedings or from Nat himself.

Best, Rod Little

Quoting Frank E Harrell Jr <[email protected]>:

> David Judkins wrote:
>> Frank,
>>
>> It depends on the fineness of grain in the predictions generated by 
>> the model, but in the extreme case where there is a single nearest 
>> match for each missing case, then drawing that nearest case five 
>> times will result in five identical imputations, leading, of course, 
>> to zero between-imputation variance for any marginal statistic of 
>> the variable.  I am not certain who was first to make this point, 
>> but you can find it among other places on page 500 of J.N.K. Rao's 
>> article in the June 1996 JASA trio of competing paradigms by Rubin, 
>> Fay, and Rao.  Rao references S?rndal 1992.
>>
>> A workable approach and references to other workable approaches for 
>> predictive mean matching (aka nearest neighbor imputation) are given 
>> in Kim 2002:
>>
>> Kim, Jae Kwang (2002) Variance estimation for nearest neighbor 
>> imputation with application to census long form data ASA Proceedings 
>> of the Joint Statistical Meetings, 1857-1862 American Statistical 
>> Association (Alexandria, VA) Keywords: Fractional imputation; 
>> Jackknife; Section on Survey Research Methods; JSM --David Judkins
>
> Great point David - thanks.  For that reason, my R/S-Plus aregImpute 
> function does weighted sampling using Tukey's tricube function with a 
> sharp peak at the closest match in predicted value.  I'm getting much 
> better distributions of imputed values when I do that.
>
> Frank
>
>>
>> -----Original Message-----
>> From: Frank E Harrell Jr [mailto:[email protected]] Sent: 
>> Wednesday, March 29, 2006 3:42 PM
>> To: David Judkins
>> Cc: Paul T. Shattuck; [email protected]
>> Subject: Re: [Impute] range of imputed values for income
>>
>> David Judkins wrote:
>>
>>> I am not aware of the capabilities of IVEware, but the general question
>>> of person-level mean squared prediction error is a function of both the
>>> covariates and the imputation procedure.  As Dr. Rubin has pointed out,
>>> minimizing person-level MSPE is not typically a primary goal in the
>>> analysis of surveys and experiments although it might be important an
>>> activity like fraud detection.  Nonetheless, reduced person-level MSPE
>>> should also translate into both lower variances on estimated population
>>> and superpopulation marginal parameters and reduced bias on regression
>>> coefficients.  So you want to use as rich a set of covariates in the
>>> imputation as are available to you and to use the model-based
>>> predictions in your imputation to at least some extent.  Unfortunately,
>>> the stronger the usage you make, the more difficult it becomes to
>>> estimate the post-imputation variance.  For example, a predictive-mean
>>> matching approach to imputation defeats multiple imputation as a
>>> variance-estimation technique.  For normally distributed outcomes,
>>
>>
>> David - It's not clear to me why PMM would invalidate the using 
>> Rubin variance estimator for regression coefficient variances.  But 
>> maybe you are saying that PMM doesn't work if you are primarily 
>> interested in estimating a variance parameter (what kind?).  -Frank 
>> Harrell
>>
>>
>>> really good methods that both utilize covariate information and allow
>>> post-imputation variance estimation are pretty much Bayesian and involve
>>> Gibbs sampling to fit complex models and make reasonable posterior
>>> draws.  (See Schafer's book.) Even they do not cope well with the
>>> natural heaping in income where people round to the nearest thousand
>>> dollars or even worse. I have some papers on how to impute non-normal
>>> outcomes using covariates that are subject to missing values themselves,
>>> but I have not yet been able to develop and validate good
>>> post-imputation variance estimators to go with them.  Your 
>>> person-level MSPE seems so large that I suspect your software is
>>> not using any covariates.  While that makes post-imputation variance
>>> estimation easy, it seems like you could do better.  The 
>>> preservation of the marginal first and second order moments of
>>> income seem to support the idea that you are not using any covariates.
>>> The robustness of the model coefficients is harder to reconcile.  I
>>> think this can only happen with a simple imputation procedure if the
>>> missing data rate is negligible or if the model isn't very good to begin
>>> with.  If substantial numbers of subjects were being thrown back and
>>> forth between $3,000 and $100,000 per year, the coefficients in good
>>> models would certainly be attenuated. Maybe you just don't have any
>>> variables that are strongly related to income?
>>>
>>> David Judkins Senior Statistician Westat 1650 Research Boulevard 
>>> Rockville, MD 20850 (301) 315-5970 [email protected] 
>>> -----Original Message-----
>>> From: [email protected]
>>> [mailto:[email protected]] On Behalf Of Paul T.
>>> Shattuck
>>> Sent: Wednesday, March 29, 2006 11:43 AM
>>> To: [email protected]
>>> Subject: [Impute] range of imputed values for income
>>>
>>> Hello,
>>>
>>> I am using IVEware for multiple imputation for the first time on a large
>>>
>>> national health survey.  One of the variables imputed is income and 
>>> I'm finding that imputed values can vary dramatically 
>>> within-subjects across
>>>
>>> multiply imputed datasets.  For instance, in some cases Person A 
>>> might have an imputed income of $3,000 in one imputation, and then 
>>> $$100,000 in another imputation.  This within-person variability 
>>> far exceeds what I'm seeing with other variables in the survey.  
>>> The distributions, means, and standard deviations of the imputed 
>>> vs. non-imputed values are
>>>
>>> comparable.  And multivariate regression results using the multiply 
>>> imputed datasets and the original dataset with missing values are 
>>> reasonably robust, with the same substantive conclusions and very 
>>> close coefficient estimates.  So, I'm wondering if this degree of 
>>> within-subject variability across imputations is something to worry 
>>> about, and potentially an indicator of a mis-specified imputation 
>>> model....or whether this kind of within-subject variability across 
>>> imputed datasets is typical.
>>>
>>> Thanks,
>>>
>>> Paul Shattuck
>>>
>>
>>
>>
>
>
> -- 
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University
>
> _______________________________________________
> Impute mailing list
> [email protected]
> http://lists.utsouthwestern.edu/mailman/listinfo/impute
>
>
>



Reply via email to