On 1 Dec 2001, jenny wrote: > What should I do with the missing values in my data. I need to > perform a t test of two samples to test the mean difference between > them. > How should I handle them in S-Plus or SAS?
1. What do S-Plus and/or SAS do with missing values by default? (All packages have defaults, and sometimes they're even sensible ones. If your package(s) do what you want done, or at least do something you can live with, that's probably the most comfortable resolution of your question.) 2. Why are there missing values? And what do these reasons imply (if anything) about the values themselves? There are essentially two choices available: (a) treat the values as missing, that is, discard each of the cases for which the variable in question is missing for the duration of the analysis of that variable, and retrieve those cases again when dealing with some other variable for which their value is not missing. This is the default in MINITAB and SPSS, although for some analyses (in both packages) the missing cases are deleted listwise (in multiple regression, for example, if any of the variables in the model be missing, the whole case is deleted fron the analysis) and for some the missing cases are deleted pairwise (in reporting a correlation matrix, for example, a case is deleted from the computation of a correlation coefficient if either of the two variables is missing, but is retained for other correlation coefficients for which both variables are non-missing in this case). (b) Impute some value to the missing variable for this case. There are a great variety of imputation schemes, all of them (so far as I know) suffering from the logical defect that one must assume something about the missing value, and the assumption may not only be untrue, it may be wildly in error. One approach is to substitute the mean of this variable for the missing value; but if the _reason_ the value is missing implies that the actual value is likely to be extremely high or extremely low, this is evidently not a good strategy. Another approach is to use some variant of multiple regression to predict the missing value from the existing values of other variables; again, this assumes that the missing value would be close to the regression line (or surface), and if the _reason_ implies an extreme value or outlier, this is not particularly likely to yield a realistic value. This is of course a simplified account (some might say oversimplified) of the problem of missing-ness, but may suggest some useful ideas. Personally, I generally prefer to acknowledge that I don't know the value that's missing, and let the case be temporarily discarded, at least for a first run at an analysis (or series of analyses); most of the time. And if I chose to use a method of imputation, I'd usually want to report results both of analyses in which the missing data are honestly missing, and analyses in which imputed values are used, so that I (and my readers) could see the effect(s) of the imputation. And since you want to test for differences between means, you almost certainly should NOT substitute a _mean_ for any missing value. If you substitute the overall mean, you will tend to diminish the real difference, if any, between the two sample means, and if there's a lot of missing data you could end up not finding differences where they would have been evident if you'd permitted the missing cases to be discarded. If yhou substitute the mean of this subgroup, you will not change the apparent difference between the means, but you WILL reduce the within-group (pooled or not) variance, so that you will have spuriously high sensitivity to differences between the means. Whether there is an aregument that would support any other method of imputation in your case, I cannot tell. I'm inclined to doubt it, but that maybe merely a reflection of my usual skepticism (or, perhaps, curmudgeonliness). ------------------------------------------------------------------------ Donald F. Burrill [EMAIL PROTECTED] 184 Nashua Road, Bedford, NH 03110 603-471-7128 ================================================================= Instructions for joining and leaving this list and remarks about the problem of INAPPROPRIATE MESSAGES are available at http://jse.stat.ncsu.edu/ =================================================================