I think you reversed patterns A and B in your text. A is missing year 3 and B is missing year 1.
My AutoImpute software has an option that allows the user to exclude any variable from the predictive set for another, so yes, such software exists, but it is licensed only for Westat projects. I don't think that IVEware offers users that much control over model formation, but it might practically do something very similar by estimating a coefficient for Year 1 in the Year 3 model near zero. It would pretty much have to since the correlation between I1 and O3 on B cases is likely to be very small (I1 standing for preliminary imputation of year 1). I think this sort of argument would apply to any chained approach to imputation. --Dave From: Impute -- Imputations in Data Analysis [mailto:[email protected]] On Behalf Of Paul von Hippel Sent: Tuesday, September 20, 2011 9:21 AM To: [email protected] Subject: Re: Imputing panel data, constraining correlations at long lags Thanks, Dave. You've come up with a nicely simplified version of my problem. Suppose I had only three waves of data, with every subject missing either wave 1 (your pattern A) or wave 3 (your pattern B). Ordinarily I would put the data in wide format -- A O1 O2 M3 B M1 O2 O3 -- and impute using a multivariate normal model. However, I don't think that would work in this case because the MVN model would want to estimate the correlation between wave 1 and wave 3, and there are no cases where both wave 1 and wave 3 are observed. However, if I could tell the software that this was, say, an AR(1) process -- or, equivalently, that partial correlation between waves 1 and 3 is zero -- I'd be in business. This could be done using MVN software that allowed me to impose constraints on the covariance matrix, or imputation software for serially correlated data. Does such software exist? Best, Paul ________________________________ From: David Judkins <[email protected]> To: [email protected] Sent: Tuesday, September 20, 2011 7:25 AM Subject: Re: Imputing panel data, constraining correlations at long lags Paul, This sounds pretty challenging. Reminds me of Andrew Gelman's JSM talk and 1998 JASA paper on imputation of questions not asked. It also reminds me of a remark some speaker made this year at JSM about almost all natural processes being Markov chains. Not sure I buy that, but I think he meant that if you have a rich enough state vector, then one past observation is all you need. Of course, that would be trivially true if the state vector contained lagged latent values. In this case,I doubt your state vector is rich enough to compensate for the brevity of the student-level time series, but I guess you have to work with what you have. Whatever you do I imagine will involve a lot of custom programming. However, you might be able to Raghu's IVEware on a series of specially reshaped versions of your data. For example, to impute year 3 for subject a and year 1 for subject B, you might create a a dataset with only A and B records in it shaped like this: A O1 O2 M3 B M1 O2 O3 Once that was done, you could proceed to imputing Year 4 on A and B records and Year 2 on C records with a dataset shaped from B and C records as A O2 I3 M4 B O2 O3 M4 C M2 O3 O4 And so on. At the end of that, you would have 4 observed/imputed years per subject. There should then be a way to generalize to more than 4 per subject. Not very elegant, but it might work. --Dave ________________________________ From: Impute -- Imputations in Data Analysis [[email protected]] on behalf of Paul von Hippel [[email protected]] Sent: Monday, September 19, 2011 5:58 PM To: [email protected] Subject: Imputing panel data, constraining correlations at long lags I have panel data where different students are tested for overlapping 2-year periods. * Subject A is observed for years 1 & 2. * Subject B is observed for years 2 & 3. * Subject C is observed for years 3 & 4. * etc up to year 12 (of school) For each observed year there are three separate test occasions (fall, winter, spring) and two subjects (reading, math). It seems to me I can impute the missing test scores provided I am willing to assume something about lags that are 2 years are longer. For example, I could assume that the partial correlation at lags of 2 years or longer is zero. This is not an unreasonable assumption since the correlations at shorter lags are very strong (.8-.9). Is there software that will allow me to do this conveniently? My usual strategy is to reshape the data from long to wide and then impute using a multivariate normal model. There are several packages that will permit this; however, I am not aware of software that will let me constrain the covariance matrix in the way I have described. I have not used imputation software that are tailored for panel data -- such as Schafer et al's PAN package, recently ported from S-Plus to R. I could try that, provided there is a convenient way to restrict the long lags. Thanks! -- Best wishes, Paul von Hippel Assistant Professor LBJ School of Public Affairs Sid Richardson Hall 3.251 University of Texas, Austin 2315 Red River, Box Y Austin, TX 78712 mobile, preferred (614) 282-8963 office (512) 232-3650
