Re: [R] svy / weighted regression
Dear Thomas David That makes sense! If I wanted to use survey on the summarized data, I suppose that I could 'de-summarize' or 're-individualize' the data to give the design object the correct information on the number of observations. Or I could revert to using the actual individual-level data. Thanks a lot, your input has been very helpful. Laust Post doc. Laust Mortensen, PhD Epidemiology Unit University of Southern Denmark 2009/10/13 Thomas Lumley tlum...@u.washington.edu: I think there is a much simpler explanation. The survey design object has eight observations, two per country. With a sample size of two per country it is hardly surprising that country-specific estimates are not very precise. The actual data has hundreds of thousands of observations per country, so it will have more precise estimates. Grouping the data doesn't make a difference for model-based glm estimation, where it is simply a computational convenience. It *does* make a difference for design-based estimation, because it changes the design. -thomas On Tue, 13 Oct 2009, Laust wrote: Dear David, Thanks again for your input! I realize that I did a bad job of explaining this in my first email, but the setup is that in Finland persons who die are sampled with a different probability (1) from those who live (.5). This was done by the Finnish data protection authorities to protect individuals against identification. In the rest of the countries everyone is sampled with a probability of 1. The data that I am supplying to R is summarized data for each country stratified by case status. Another way of organizing the data would be: # creating data listc - c(Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1) listd - c(1000,1000,1000,2000) listt - c(755000,505000,905000,191) list.cwdt - c(listc, listw, listd, listt) country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt) I hope that it is clearer now that for no value of the independent variable 'country' is the rate going to be zero. I think this was also not the case in my original example, but this was obscured by my poor communication- R-skills. But if data is organized this way then sampling weight of 2 for Finland should only be applied to the time-variable that contains person years at risk and *not* to the number of deaths, which would complicate matters further. I would know how to get this to work in R or in any other statistical package. Perhaps it is - as Peter Dalgaard suggested - the estimation of the dispersion parameter by the survey package that is causing trouble, not the data example eo ipso. Or perhaps I am just using survey in a wrong way. Best Laust Post doc. Laust Mortensen, PhD Epidemiology Unit University of Southern Denmark On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net wrote: I think you are missing the point. You have 4 zero death counts associated with much higher person years of exposure followed by 4 death counts in the thousands associated with lower degrees of exposures. It seems unlikely that these are real data as there are not cohorts that would exhibit such lower death-rates. So it appears that in setting up your test case, you have created an impossibly unrealistic test problem. -- David On Oct 12, 2009, at 9:12 AM, Laust wrote: Dear Peter, Thanks for the input. The zero rates in some strata occurs because sampling depended on case status: In Finland only 50% of the non-cases were sampled, while all others were sampled with 100% probability. Best Laust On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in 5000 person-years for Denmark, and in your model they are supposed to have the same Poisson rate. summary.svyglm starts off with
Re: [R] svy / weighted regression
Dear David, Thanks again for your input! I realize that I did a bad job of explaining this in my first email, but the setup is that in Finland persons who die are sampled with a different probability (1) from those who live (.5). This was done by the Finnish data protection authorities to protect individuals against identification. In the rest of the countries everyone is sampled with a probability of 1. The data that I am supplying to R is summarized data for each country stratified by case status. Another way of organizing the data would be: # creating data listc - c(Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1) listd - c(1000,1000,1000,2000) listt - c(755000,505000,905000,191) list.cwdt - c(listc, listw, listd, listt) country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt) I hope that it is clearer now that for no value of the independent variable 'country' is the rate going to be zero. I think this was also not the case in my original example, but this was obscured by my poor communication- R-skills. But if data is organized this way then sampling weight of 2 for Finland should only be applied to the time-variable that contains person years at risk and *not* to the number of deaths, which would complicate matters further. I would know how to get this to work in R or in any other statistical package. Perhaps it is - as Peter Dalgaard suggested - the estimation of the dispersion parameter by the survey package that is causing trouble, not the data example eo ipso. Or perhaps I am just using survey in a wrong way. Best Laust Post doc. Laust Mortensen, PhD Epidemiology Unit University of Southern Denmark On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net wrote: I think you are missing the point. You have 4 zero death counts associated with much higher person years of exposure followed by 4 death counts in the thousands associated with lower degrees of exposures. It seems unlikely that these are real data as there are not cohorts that would exhibit such lower death-rates. So it appears that in setting up your test case, you have created an impossibly unrealistic test problem. -- David On Oct 12, 2009, at 9:12 AM, Laust wrote: Dear Peter, Thanks for the input. The zero rates in some strata occurs because sampling depended on case status: In Finland only 50% of the non-cases were sampled, while all others were sampled with 100% probability. Best Laust On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in 5000 person-years for Denmark, and in your model they are supposed to have the same Poisson rate. summary.svyglm starts off with est.disp - TRUE and AFAICS there is no way it can get set to FALSE. Knowing Thomas, there is probably a perfectly good reason not to just set the dispersion to 1, but I don't get it either... Thanks Laust # loading survey library(survey) # creating data listc - c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1,1,1,1,1) listd - c(0,0,0,0,1000,1000,1000,2000) listt - c(75,50,90,190,5000,5000,5000,1) list.cwdt - c(listc, listw, listd, listt) country - data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt) # running a frequency weighted regression to get the correct point estimates for comparison glm - glm(deaths ~ country + offset(log(yrs_at_risk)), weights=weight, data=country, family=poisson()) summary(glm) regTermTest(glm, ~ country) # running survey weighted regression svy - svydesign(~0,,data=country, weight=~weight) svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)), design=svy, data=country, family=poisson()) summary(svyglm) # point estimates are correct, but standard error is way too large regTermTest(svyglm, ~
Re: [R] svy / weighted regression
I think there is a much simpler explanation. The survey design object has eight observations, two per country. With a sample size of two per country it is hardly surprising that country-specific estimates are not very precise. The actual data has hundreds of thousands of observations per country, so it will have more precise estimates. Grouping the data doesn't make a difference for model-based glm estimation, where it is simply a computational convenience. It *does* make a difference for design-based estimation, because it changes the design. -thomas On Tue, 13 Oct 2009, Laust wrote: Dear David, Thanks again for your input! I realize that I did a bad job of explaining this in my first email, but the setup is that in Finland persons who die are sampled with a different probability (1) from those who live (.5). This was done by the Finnish data protection authorities to protect individuals against identification. In the rest of the countries everyone is sampled with a probability of 1. The data that I am supplying to R is summarized data for each country stratified by case status. Another way of organizing the data would be: # creating data listc - c(Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1) listd - c(1000,1000,1000,2000) listt - c(755000,505000,905000,191) list.cwdt - c(listc, listw, listd, listt) country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt) I hope that it is clearer now that for no value of the independent variable 'country' is the rate going to be zero. I think this was also not the case in my original example, but this was obscured by my poor communication- R-skills. But if data is organized this way then sampling weight of 2 for Finland should only be applied to the time-variable that contains person years at risk and *not* to the number of deaths, which would complicate matters further. I would know how to get this to work in R or in any other statistical package. Perhaps it is - as Peter Dalgaard suggested - the estimation of the dispersion parameter by the survey package that is causing trouble, not the data example eo ipso. Or perhaps I am just using survey in a wrong way. Best Laust Post doc. Laust Mortensen, PhD Epidemiology Unit University of Southern Denmark On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net wrote: I think you are missing the point. You have 4 zero death counts associated with much higher person years of exposure followed by 4 death counts in the thousands associated with lower degrees of exposures. It seems unlikely that these are real data as there are not cohorts that would exhibit such lower death-rates. So it appears that in setting up your test case, you have created an impossibly unrealistic test problem. -- David On Oct 12, 2009, at 9:12 AM, Laust wrote: Dear Peter, Thanks for the input. The zero rates in some strata occurs because sampling depended on case status: In Finland only 50% of the non-cases were sampled, while all others were sampled with 100% probability. Best Laust On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in 5000 person-years for Denmark, and in your model they are supposed to have the same Poisson rate. summary.svyglm starts off with est.disp - TRUE and AFAICS there is no way it can get set to FALSE. Knowing Thomas, there is probably a perfectly good reason not to just set the dispersion to 1, but I don't get it either... Thanks Laust # loading survey library(survey) # creating data listc - c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1,1,1,1,1) listd - c(0,0,0,0,1000,1000,1000,2000) listt - c(75,50,90,190,5000,5000,5000,1) list.cwdt - c(listc, listw, listd, listt) country -
Re: [R] svy / weighted regression
On Oct 13, 2009, at 6:07 AM, Laust wrote: Dear David, Thanks again for your input! I realize that I did a bad job of explaining this in my first email, but the setup is that in Finland persons who die are sampled with a different probability (1) from those who live (.5). This was done by the Finnish data protection authorities to protect individuals against identification. In the rest of the countries everyone is sampled with a probability of 1. The data that I am supplying to R is summarized data for each country stratified by case status. Another way of organizing the data would be: # creating data listc - c(Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1) listd - c(1000,1000,1000,2000) listt - c(755000,505000,905000,191) list.cwdt - c(listc, listw, listd, listt) country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt) I hope that it is clearer now that for no value of the independent variable 'country' is the rate going to be zero. It is clearer now, and I think you were correct in believing that should not have been the problem, so please accept my apologies. The denominators and numerators should have been properly summed prior to estimation. I think this was also not the case in my original example, but this was obscured by my poor communication- R-skills. But if data is organized this way then sampling weight of 2 for Finland should only be applied to the time-variable that contains person years at risk and *not* to the number of deaths, which would complicate matters further. I would know how to get this to work in R or in any other statistical package. Perhaps it is - as Peter Dalgaard suggested - the estimation of the dispersion parameter by the survey package that is causing trouble, not the data example eo ipso. Or perhaps I am just using survey in a wrong way. I think it is likely that we are now both using it incorrectly, but my efforts are also creating nonsense. From the help page I thought that the formula in svydesign might be need to be the country variable ... wrong. Or that the weights might need to be the inverse of what you had used ...wrong. Or that you ought to use quasipoisson for the family wrong again. Lumley is preparing a book to accompany the package but that is still several months away from release. He and Norm Breslow also published a paper very recently in the American Journal of Epidemiology on the using of survey sampling for analysis of case-cohort designs (of which your problem seems to be an exceedingly simple example, albeit only in one of the four strata.) I don't have access to the original paper at the moment, but perhaps you are in an academic setting where such access would be routine. Or probably even more efficient would be to shoot a letter to Thomas Lumley. -- David Best Laust Post doc. Laust Mortensen, PhD Epidemiology Unit University of Southern Denmark On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net wrote: I think you are missing the point. You have 4 zero death counts associated with much higher person years of exposure followed by 4 death counts in the thousands associated with lower degrees of exposures. It seems unlikely that these are real data as there are not cohorts that would exhibit such lower death-rates. So it appears that in setting up your test case, you have created an impossibly unrealistic test problem. -- David On Oct 12, 2009, at 9:12 AM, Laust wrote: Dear Peter, Thanks for the input. The zero rates in some strata occurs because sampling depended on case status: In Finland only 50% of the non- cases were sampled, while all others were sampled with 100% probability. Best Laust On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression- procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in
Re: [R] svy / weighted regression
Dear Peter, Thanks for the input. The zero rates in some strata occurs because sampling depended on case status: In Finland only 50% of the non-cases were sampled, while all others were sampled with 100% probability. Best Laust On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in 5000 person-years for Denmark, and in your model they are supposed to have the same Poisson rate. summary.svyglm starts off with est.disp - TRUE and AFAICS there is no way it can get set to FALSE. Knowing Thomas, there is probably a perfectly good reason not to just set the dispersion to 1, but I don't get it either... Thanks Laust # loading survey library(survey) # creating data listc - c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1,1,1,1,1) listd - c(0,0,0,0,1000,1000,1000,2000) listt - c(75,50,90,190,5000,5000,5000,1) list.cwdt - c(listc, listw, listd, listt) country - data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt) # running a frequency weighted regression to get the correct point estimates for comparison glm - glm(deaths ~ country + offset(log(yrs_at_risk)), weights=weight, data=country, family=poisson()) summary(glm) regTermTest(glm, ~ country) # running survey weighted regression svy - svydesign(~0,,data=country, weight=~weight) svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)), design=svy, data=country, family=poisson()) summary(svyglm) # point estimates are correct, but standard error is way too large regTermTest(svyglm, ~ country) # test indicates no country differences __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] svy / weighted regression
I think you are missing the point. You have 4 zero death counts associated with much higher person years of exposure followed by 4 death counts in the thousands associated with lower degrees of exposures. It seems unlikely that these are real data as there are not cohorts that would exhibit such lower death-rates. So it appears that in setting up your test case, you have created an impossibly unrealistic test problem. -- David On Oct 12, 2009, at 9:12 AM, Laust wrote: Dear Peter, Thanks for the input. The zero rates in some strata occurs because sampling depended on case status: In Finland only 50% of the non-cases were sampled, while all others were sampled with 100% probability. Best Laust On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard p.dalga...@biostat.ku.dk wrote: Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in 5000 person-years for Denmark, and in your model they are supposed to have the same Poisson rate. summary.svyglm starts off with est.disp - TRUE and AFAICS there is no way it can get set to FALSE. Knowing Thomas, there is probably a perfectly good reason not to just set the dispersion to 1, but I don't get it either... Thanks Laust # loading survey library(survey) # creating data listc - c (Denmark ,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1,1,1,1,1) listd - c(0,0,0,0,1000,1000,1000,2000) listt - c(75,50,90,190,5000,5000,5000,1) list.cwdt - c(listc, listw, listd, listt) country - data .frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt) # running a frequency weighted regression to get the correct point estimates for comparison glm - glm(deaths ~ country + offset(log(yrs_at_risk)), weights=weight, data=country, family=poisson()) summary(glm) regTermTest(glm, ~ country) # running survey weighted regression svy - svydesign(~0,,data=country, weight=~weight) svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)), design=svy, data=country, family=poisson()) summary(svyglm) # point estimates are correct, but standard error is way too large regTermTest(svyglm, ~ country) # test indicates no country differences __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] svy / weighted regression
Sorry, forgot to reply all... Laust wrote: Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Hi Laust, You probably need the package author to explain fully, but as far as I can see, the crux is that a dispersion parameter is being used, based on Pearson residuals, even in the Poisson case (i.e. you effectively get the same result as with quasipoisson()). I don't know what the rationale is for this, but it is clear that with your data, an estimated dispersion parameter is going to be large. E.g. the data has both 0 cases in 75 person-years and 1000 cases in 5000 person-years for Denmark, and in your model they are supposed to have the same Poisson rate. summary.svyglm starts off with est.disp - TRUE and AFAICS there is no way it can get set to FALSE. Knowing Thomas, there is probably a perfectly good reason not to just set the dispersion to 1, but I don't get it either... Thanks Laust # loading survey library(survey) # creating data listc - c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1,1,1,1,1) listd - c(0,0,0,0,1000,1000,1000,2000) listt - c(75,50,90,190,5000,5000,5000,1) list.cwdt - c(listc, listw, listd, listt) country - data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt) # running a frequency weighted regression to get the correct point estimates for comparison glm - glm(deaths ~ country + offset(log(yrs_at_risk)), weights=weight, data=country, family=poisson()) summary(glm) regTermTest(glm, ~ country) # running survey weighted regression svy - svydesign(~0,,data=country, weight=~weight) svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)), design=svy, data=country, family=poisson()) summary(svyglm) # point estimates are correct, but standard error is way too large regTermTest(svyglm, ~ country) # test indicates no country differences __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] svy / weighted regression
Dear list, I am trying to set up a propensity-weighted regression using the survey package. Most of my population is sampled with a sampling probability of one (that is, I have the full population). However, for a subset of the data I have only a 50% sample of the full population. In previous work on the data, I analyzed these data using SAS and STATA. In those packages I used a propensity weight of 1/[sampling probability] in various generalized linear regression-procedures, but I am having trouble setting this up. I bet the solution is simple, but I’m a R newbie. Code to illustrate my problem below. Thanks Laust # loading survey library(survey) # creating data listc - c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden) listw - c(1,2,1,1,1,1,1,1) listd - c(0,0,0,0,1000,1000,1000,2000) listt - c(75,50,90,190,5000,5000,5000,1) list.cwdt - c(listc, listw, listd, listt) country - data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt) # running a frequency weighted regression to get the correct point estimates for comparison glm - glm(deaths ~ country + offset(log(yrs_at_risk)), weights=weight, data=country, family=poisson()) summary(glm) regTermTest(glm, ~ country) # running survey weighted regression svy - svydesign(~0,,data=country, weight=~weight) svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)), design=svy, data=country, family=poisson()) summary(svyglm) # point estimates are correct, but standard error is way too large regTermTest(svyglm, ~ country) # test indicates no country differences __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.