I have some two-stage population surveys that I need to analyze. If you'll bear with me, I'll try to progress from simple to complex as the possible designs progresses from simple to complex. This is to achieve clarity for my self and not waste your time. I hope some less proficient readers will benefit, too. I'll assume ignorable nonresponse.
The practical issue is whether it is better to use: 1. S-PLUS (Schafers software) for multiple imputation, or 2. SUDAAN, which is specialized to analyse two-stage surveys (but does not concern non-response in general). "Ordinary" survey analysis software accounts for: a. stratified sampling in the first or second stage b. differential sampling probabilities within strata c. clustering d. a two-stage design (SUDAAN only) As far as I know, multiple imputation gives you the same capabilities, and it explicitly specifies a non-response model and accounts for non-response. I don't know about clustering, since I don't have it. In the first stage, a simple random sample is drawn from a finite population. A postal questionnaire is sent out and returned. Let X be covariates that are observed for all (e.g., age and gender), but is not a design variable. Let Y be the observed data that potentially have missing values (e.g. wheezing 0/1, asthma 0/1). Let R indicate response (0/1). Case 1: full response from everybody in the sample Relative to a census, the available data are MCAR, and a complete-data analysis is valid. Case 2: some non-response, fully random Relative to case 1 (full data), the data are MCAR, and a complete-data analysis is valid. However, you will lose power due to the exclusion of subjects with only X observed. If you use a multiple imputation approach, you will not throw away the fact that you do know the X'es. This will provide some increase in precision. Case 3: some non-response, related to X and only X e.g. the data are missing at random within categories of X and only X. Multiple imputation is appropriate, with a simple model that uses only X, R and Y to impute. Now, in a second stage, a simple random sample is drawn from the *non-responders* in the first stage. Let I indicate whether or not a person is selected for the follow-up. Case 4: some non-response in the follow-up stage, related to X and only X Multiple imputation is appropriate, with a simple model that uses X, R, I and Y to impute. Now, do I need to "weigh up" the selected respondents ? In "ordinary" survey-analysis software you would have to. But as I understand, you impute the answers for the non-selected persons instead of "weighting" only the selected people. So I think I don't need that. This scenario 4 is what often happens in surveys, and so should have some interest. Another scenario is that in which you do an inexpensive questionnaire on the entire sample, and do biological measurements on a subsample. As long as the followup sampling is simple random, case 4 applies. However, some times you want to do a stratified sampling depending on answers in the previous questionnaire: Case 5: instead of a simple random sample drawn from the non-responders, draw a _stratified sample_ with differential sampling probabilities, depending on Y. E.g., select 50% of the wheezers and 10% of the non-wheezers. Let RR be the variable indicating response in the clinical followup, and let YY be the new data that you get (e.g. lung function). This is the case that I need to crack now. Conceptually, you would imagine two stages: a) impute Y depending on X and R. Thus, you fill out the missing data in the first stage. Then: b) impute YY depending on X, R, Y (imputed and real), I and RR. Thus, you fill out the missing data in the second stage. However, if you do steps a) and b) only once, you'll lose the uncertainty that comes with imputation. You could do 5 steps of imputation a), and within each step of a) do 5 steps of imputation b). You'd end up with 25 data sets, and I'm not sure how to summarize this. Does anybody have any good suggestions for case 5 ? Can you do it all in a single model ? It is quite common in my field, and so should have some general interest. The bottom line is: do I need SUDAAN ? I really don't think so. Yours gratefully, Jan Brogger PhD student, Respiratory Epidemiology Group, University of Bergen, Norway
