On Wed, Jul 6, 2011 at 7:14 PM, Christopher Jordan-Squire <cjord...@uw.edu> wrote: > On Wed, Jul 6, 2011 at 3:47 PM, <josef.p...@gmail.com> wrote: >> On Wed, Jul 6, 2011 at 4:38 PM, <josef.p...@gmail.com> wrote: >> > On Wed, Jul 6, 2011 at 4:22 PM, Christopher Jordan-Squire <snip> >> >> Mean value replacement, or more generally single scalar value >> >> replacement, >> >> is generally not a good idea. It biases downward your standard error >> >> estimates if you use mean replacement, and it will bias both if you use >> >> anything other than mean replacement. The bias is gets worse with more >> >> missing data. So it's worst in the precisely the cases where you'd want >> >> to >> >> fill in the data the most. (Though I admit I'm not too familiar with >> >> time >> >> series, so maybe this doesn't apply. But it's true as a general >> >> principle in >> >> statistics.) I'm not sure why we'd want to make this use case easier. >> >> Another qualification on this (I cannot help it). >> I think this only applies if you use a prefabricated no-missing-values >> algorithm. If I write it myself, I can do the proper correction for >> the reduced number of observations. (similar to the case when we >> ignore correlated information and use statistics based on uncorrelated >> observations which also overestimate the amount of information we have >> available.) >> > > Can you do that sort of technique with longitudinal (panel) data? I'm > honestly curious because I haven't looked into such corrections before. I > haven't been able to find a reference after a few quick google searches. I > don't suppose you know one off the top of your head? > And you're right about the last measurement carried forward. I was just > thinking about filling in all missing values with the same value. > -Chris Jordan-Squire > PS--Thanks for mentioning the statsmodels discussion. I'd been keeping track > of that on a different email account, and I haven't realized it wasn't > forwarding those messages correctly. >
Maybe a bit OT, but I've seen people doing imputation using Bayesian MCMC or multiple imputation for missing values in panel data. Google 'data augmentation' or 'multiple imputation'. I haven't looked much into the details yet, but it's definitely not mean replacement. FWIW (I haven't been following closely the discussion), there is a distinction in statistics between ignorable and nonignorable missing data, but I can't think of a situation where I would need this at the computational level rather than relying on a (numerically comparable) missing data type(s) a la SAS/Stata. I've also found the odd examples of IGNORE without a clear answer to be scary. Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion