Re: [R] svy / weighted regression

2009-10-14 Thread Laust
Dear Thomas  David

That makes sense! If I wanted to use survey on the summarized data, I
suppose that I could 'de-summarize' or 're-individualize' the data to
give the design object the correct information on the number of
observations. Or I could revert to using the actual individual-level
data.

Thanks a lot, your input has been very helpful.
Laust



Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark

2009/10/13 Thomas Lumley tlum...@u.washington.edu:

 I think there is a much simpler explanation.

 The survey design object has eight observations, two per country.   With a 
 sample size of two per country it is hardly surprising that country-specific 
 estimates are not very precise. The actual data has hundreds of thousands  of 
 observations per country, so it will have more precise estimates.

 Grouping the data doesn't make a difference for model-based glm estimation, 
 where it is simply a computational convenience. It *does* make a difference 
 for design-based estimation, because it changes the design.

         -thomas


 On Tue, 13 Oct 2009, Laust wrote:

 Dear David,

 Thanks again for your input! I realize that I did a bad job of
 explaining this in my first email, but the setup is that in Finland
 persons who die are sampled with a different probability (1) from
 those who live (.5). This was done by the Finnish data protection
 authorities to protect individuals against identification. In the rest
 of the countries everyone is sampled with a probability of 1. The data
 that I am supplying to R is summarized data for each country
 stratified by case status. Another way of organizing the data would
 be:

 # creating data
 listc - c(Denmark,Finland,Norway,Sweden)
 listw - c(1,2,1,1)
 listd - c(1000,1000,1000,2000)
 listt - c(755000,505000,905000,191)
 list.cwdt - c(listc, listw, listd, listt)
 country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt)

 I hope that it is clearer now that for no value of the independent
 variable 'country' is the rate going to be zero. I think this was also
 not the case in my original example, but this was obscured by my poor
 communication-  R-skills. But if data is organized this way then
 sampling weight of 2 for Finland should only be applied to the
 time-variable that contains person years at risk and *not* to the
 number of deaths, which would complicate matters further. I would know
 how to get this to work in R or in any other statistical package.
 Perhaps it is - as Peter Dalgaard suggested - the estimation of the
 dispersion parameter by the survey package that is causing trouble,
 not the data example eo ipso. Or perhaps I am just using survey in a
 wrong way.

 Best
 Laust

 
 Post doc. Laust Mortensen, PhD
 Epidemiology Unit
 University of Southern Denmark

 On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net 
 wrote:

 I think you are missing the point. You have 4 zero death counts associated
 with much higher person years of exposure followed by 4 death counts in the
 thousands associated with lower degrees of exposures. It seems unlikely that
 these are real data as there are not cohorts that would exhibit such lower
 death-rates. So it appears that in setting up your test case, you have
 created an impossibly unrealistic test problem.

 --
 David


 On Oct 12, 2009, at 9:12 AM, Laust wrote:

 Dear Peter,

 Thanks for the input. The zero rates in some strata occurs because
 sampling depended on case status: In Finland only 50% of the non-cases
 were sampled, while all others were sampled with 100% probability.

 Best
 Laust

 On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
 p.dalga...@biostat.ku.dk wrote:

 Sorry, forgot to reply all...

 Laust wrote:

 Dear list,

 I am trying to set up a propensity-weighted regression using the
 survey package. Most of my population is sampled with a sampling
 probability of one (that is, I have the full population). However, for
 a subset of the data I have only a 50% sample of the full population.
 In previous work on the data, I analyzed these data using SAS and
 STATA. In those packages I used a propensity weight of 1/[sampling
 probability] in various generalized linear regression-procedures, but
 I am having trouble setting this up. I bet the solution is simple, but
 I’m a R newbie. Code to illustrate my problem below.

 Hi Laust,

 You probably need the package author to explain fully, but as far as I
 can see, the crux is that a dispersion parameter is being used, based on
 Pearson residuals, even in the Poisson case (i.e. you effectively get
 the same result as with quasipoisson()).

 I don't know what the rationale is for this, but it is clear that with
 your data, an estimated dispersion parameter is going to be large. E.g.
 the data has both 0 cases in 75 person-years and 1000 cases in 5000
 person-years for Denmark, and in your model they are supposed to have
 the same Poisson rate.

 summary.svyglm starts off with

Re: [R] svy / weighted regression

2009-10-13 Thread Laust
Dear David,

Thanks again for your input! I realize that I did a bad job of
explaining this in my first email, but the setup is that in Finland
persons who die are sampled with a different probability (1) from
those who live (.5). This was done by the Finnish data protection
authorities to protect individuals against identification. In the rest
of the countries everyone is sampled with a probability of 1. The data
that I am supplying to R is summarized data for each country
stratified by case status. Another way of organizing the data would
be:

# creating data
listc - c(Denmark,Finland,Norway,Sweden)
listw - c(1,2,1,1)
listd - c(1000,1000,1000,2000)
listt - c(755000,505000,905000,191)
list.cwdt - c(listc, listw, listd, listt)
country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt)

I hope that it is clearer now that for no value of the independent
variable 'country' is the rate going to be zero. I think this was also
not the case in my original example, but this was obscured by my poor
communication-  R-skills. But if data is organized this way then
sampling weight of 2 for Finland should only be applied to the
time-variable that contains person years at risk and *not* to the
number of deaths, which would complicate matters further. I would know
how to get this to work in R or in any other statistical package.
Perhaps it is - as Peter Dalgaard suggested - the estimation of the
dispersion parameter by the survey package that is causing trouble,
not the data example eo ipso. Or perhaps I am just using survey in a
wrong way.

Best
Laust


Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark

On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net wrote:
 I think you are missing the point. You have 4 zero death counts associated
 with much higher person years of exposure followed by 4 death counts in the
 thousands associated with lower degrees of exposures. It seems unlikely that
 these are real data as there are not cohorts that would exhibit such lower
 death-rates. So it appears that in setting up your test case, you have
 created an impossibly unrealistic test problem.

 --
 David


 On Oct 12, 2009, at 9:12 AM, Laust wrote:

 Dear Peter,

 Thanks for the input. The zero rates in some strata occurs because
 sampling depended on case status: In Finland only 50% of the non-cases
 were sampled, while all others were sampled with 100% probability.

 Best
 Laust

 On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
 p.dalga...@biostat.ku.dk wrote:

 Sorry, forgot to reply all...

 Laust wrote:

 Dear list,

 I am trying to set up a propensity-weighted regression using the
 survey package. Most of my population is sampled with a sampling
 probability of one (that is, I have the full population). However, for
 a subset of the data I have only a 50% sample of the full population.
 In previous work on the data, I analyzed these data using SAS and
 STATA. In those packages I used a propensity weight of 1/[sampling
 probability] in various generalized linear regression-procedures, but
 I am having trouble setting this up. I bet the solution is simple, but
 I’m a R newbie. Code to illustrate my problem below.

 Hi Laust,

 You probably need the package author to explain fully, but as far as I
 can see, the crux is that a dispersion parameter is being used, based on
 Pearson residuals, even in the Poisson case (i.e. you effectively get
 the same result as with quasipoisson()).

 I don't know what the rationale is for this, but it is clear that with
 your data, an estimated dispersion parameter is going to be large. E.g.
 the data has both 0 cases in 75 person-years and 1000 cases in 5000
 person-years for Denmark, and in your model they are supposed to have
 the same Poisson rate.

 summary.svyglm starts off with

   est.disp - TRUE

 and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
 there is probably a perfectly good reason not to just set the dispersion
 to 1, but I don't get it either...


 Thanks
 Laust

 # loading survey
 library(survey)

 # creating data
 listc -

 c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden)
 listw - c(1,2,1,1,1,1,1,1)
 listd - c(0,0,0,0,1000,1000,1000,2000)
 listt - c(75,50,90,190,5000,5000,5000,1)
 list.cwdt - c(listc, listw, listd, listt)
 country -
 data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)

 # running a frequency weighted regression to get the correct point
 estimates for comparison
 glm - glm(deaths ~ country + offset(log(yrs_at_risk)),
 weights=weight, data=country, family=poisson())
 summary(glm)
 regTermTest(glm, ~ country)

 # running survey weighted regression
 svy - svydesign(~0,,data=country, weight=~weight)
 svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)),
 design=svy, data=country, family=poisson())
 summary(svyglm)
 # point estimates are correct, but standard error is way too large
 regTermTest(svyglm, ~ 

Re: [R] svy / weighted regression

2009-10-13 Thread Thomas Lumley


I think there is a much simpler explanation.

The survey design object has eight observations, two per country.   With a sample size of two per 
country it is hardly surprising that country-specific estimates are not very precise. The actual data has 
hundreds of thousands  of observations per country, so it will have more precise estimates.


Grouping the data doesn't make a difference for model-based glm estimation, where it is simply a 
computational convenience. It *does* make a difference for design-based estimation, because it 
changes the design.


 -thomas


On Tue, 13 Oct 2009, Laust wrote:


Dear David,

Thanks again for your input! I realize that I did a bad job of
explaining this in my first email, but the setup is that in Finland
persons who die are sampled with a different probability (1) from
those who live (.5). This was done by the Finnish data protection
authorities to protect individuals against identification. In the rest
of the countries everyone is sampled with a probability of 1. The data
that I am supplying to R is summarized data for each country
stratified by case status. Another way of organizing the data would
be:

# creating data
listc - c(Denmark,Finland,Norway,Sweden)
listw - c(1,2,1,1)
listd - c(1000,1000,1000,2000)
listt - c(755000,505000,905000,191)
list.cwdt - c(listc, listw, listd, listt)
country2 - data.frame(country=listc,weight=listw,deaths=listd,time=listt)

I hope that it is clearer now that for no value of the independent
variable 'country' is the rate going to be zero. I think this was also
not the case in my original example, but this was obscured by my poor
communication-  R-skills. But if data is organized this way then
sampling weight of 2 for Finland should only be applied to the
time-variable that contains person years at risk and *not* to the
number of deaths, which would complicate matters further. I would know
how to get this to work in R or in any other statistical package.
Perhaps it is - as Peter Dalgaard suggested - the estimation of the
dispersion parameter by the survey package that is causing trouble,
not the data example eo ipso. Or perhaps I am just using survey in a
wrong way.

Best
Laust


Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark

On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net wrote:

I think you are missing the point. You have 4 zero death counts associated
with much higher person years of exposure followed by 4 death counts in the
thousands associated with lower degrees of exposures. It seems unlikely that
these are real data as there are not cohorts that would exhibit such lower
death-rates. So it appears that in setting up your test case, you have
created an impossibly unrealistic test problem.

--
David


On Oct 12, 2009, at 9:12 AM, Laust wrote:


Dear Peter,

Thanks for the input. The zero rates in some strata occurs because
sampling depended on case status: In Finland only 50% of the non-cases
were sampled, while all others were sampled with 100% probability.

Best
Laust

On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
p.dalga...@biostat.ku.dk wrote:


Sorry, forgot to reply all...

Laust wrote:


Dear list,

I am trying to set up a propensity-weighted regression using the
survey package. Most of my population is sampled with a sampling
probability of one (that is, I have the full population). However, for
a subset of the data I have only a 50% sample of the full population.
In previous work on the data, I analyzed these data using SAS and
STATA. In those packages I used a propensity weight of 1/[sampling
probability] in various generalized linear regression-procedures, but
I am having trouble setting this up. I bet the solution is simple, but
I’m a R newbie. Code to illustrate my problem below.


Hi Laust,

You probably need the package author to explain fully, but as far as I
can see, the crux is that a dispersion parameter is being used, based on
Pearson residuals, even in the Poisson case (i.e. you effectively get
the same result as with quasipoisson()).

I don't know what the rationale is for this, but it is clear that with
your data, an estimated dispersion parameter is going to be large. E.g.
the data has both 0 cases in 75 person-years and 1000 cases in 5000
person-years for Denmark, and in your model they are supposed to have
the same Poisson rate.

summary.svyglm starts off with

  est.disp - TRUE

and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
there is probably a perfectly good reason not to just set the dispersion
to 1, but I don't get it either...



Thanks
Laust

# loading survey
library(survey)

# creating data
listc -

c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden)
listw - c(1,2,1,1,1,1,1,1)
listd - c(0,0,0,0,1000,1000,1000,2000)
listt - c(75,50,90,190,5000,5000,5000,1)
list.cwdt - c(listc, listw, listd, listt)
country -

Re: [R] svy / weighted regression

2009-10-13 Thread David Winsemius


On Oct 13, 2009, at 6:07 AM, Laust wrote:


Dear David,

Thanks again for your input! I realize that I did a bad job of
explaining this in my first email, but the setup is that in Finland
persons who die are sampled with a different probability (1) from
those who live (.5). This was done by the Finnish data protection
authorities to protect individuals against identification. In the rest
of the countries everyone is sampled with a probability of 1. The data
that I am supplying to R is summarized data for each country
stratified by case status. Another way of organizing the data would
be:

# creating data
listc - c(Denmark,Finland,Norway,Sweden)
listw - c(1,2,1,1)
listd - c(1000,1000,1000,2000)
listt - c(755000,505000,905000,191)
list.cwdt - c(listc, listw, listd, listt)
country2 -  
data.frame(country=listc,weight=listw,deaths=listd,time=listt)


I hope that it is clearer now that for no value of the independent
variable 'country' is the rate going to be zero.


It is clearer now, and I think you were correct in believing that  
should not have been the problem, so please accept my apologies. The  
denominators and numerators should have been properly summed prior to  
estimation.



I think this was also
not the case in my original example, but this was obscured by my poor
communication-  R-skills. But if data is organized this way then
sampling weight of 2 for Finland should only be applied to the
time-variable that contains person years at risk and *not* to the
number of deaths, which would complicate matters further. I would know
how to get this to work in R or in any other statistical package.




Perhaps it is - as Peter Dalgaard suggested - the estimation of the
dispersion parameter by the survey package that is causing trouble,
not the data example eo ipso. Or perhaps I am just using survey in a
wrong way.


I think it is likely that we are now both using it incorrectly, but my  
efforts are also creating nonsense. From the help page I thought that  
the formula in svydesign might be need to be the country variable ...  
wrong. Or that the weights might need to be the inverse of what you  
had used ...wrong. Or that you ought to use quasipoisson for the  
family  wrong again.


Lumley is preparing a book to accompany the package but that is still  
several months away from release. He and Norm Breslow also published a  
paper very recently in the American Journal of Epidemiology on the  
using of survey sampling for analysis of case-cohort designs (of which  
your problem seems to be an exceedingly simple example, albeit only in  
one of the four strata.) I don't have access to the original paper at  
the moment, but perhaps you are in an academic setting where such  
access would be routine.


Or probably even more efficient would be to shoot a letter to Thomas  
Lumley.


--
David



Best
Laust


Post doc. Laust Mortensen, PhD
Epidemiology Unit
University of Southern Denmark

On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius dwinsem...@comcast.net 
 wrote:
I think you are missing the point. You have 4 zero death counts  
associated
with much higher person years of exposure followed by 4 death  
counts in the
thousands associated with lower degrees of exposures. It seems  
unlikely that
these are real data as there are not cohorts that would exhibit  
such lower
death-rates. So it appears that in setting up your test case, you  
have

created an impossibly unrealistic test problem.

--
David


On Oct 12, 2009, at 9:12 AM, Laust wrote:


Dear Peter,

Thanks for the input. The zero rates in some strata occurs because
sampling depended on case status: In Finland only 50% of the non- 
cases

were sampled, while all others were sampled with 100% probability.

Best
Laust

On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
p.dalga...@biostat.ku.dk wrote:


Sorry, forgot to reply all...

Laust wrote:


Dear list,

I am trying to set up a propensity-weighted regression using the
survey package. Most of my population is sampled with a sampling
probability of one (that is, I have the full population).  
However, for
a subset of the data I have only a 50% sample of the full  
population.

In previous work on the data, I analyzed these data using SAS and
STATA. In those packages I used a propensity weight of 1/[sampling
probability] in various generalized linear regression- 
procedures, but
I am having trouble setting this up. I bet the solution is  
simple, but

I’m a R newbie. Code to illustrate my problem below.


Hi Laust,

You probably need the package author to explain fully, but as far  
as I
can see, the crux is that a dispersion parameter is being used,  
based on
Pearson residuals, even in the Poisson case (i.e. you effectively  
get

the same result as with quasipoisson()).

I don't know what the rationale is for this, but it is clear that  
with
your data, an estimated dispersion parameter is going to be  
large. E.g.
the data has both 0 cases in 75 person-years and 1000 cases  
in 

Re: [R] svy / weighted regression

2009-10-12 Thread Laust
Dear Peter,

Thanks for the input. The zero rates in some strata occurs because
sampling depended on case status: In Finland only 50% of the non-cases
were sampled, while all others were sampled with 100% probability.

Best
Laust

On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
p.dalga...@biostat.ku.dk wrote:
 Sorry, forgot to reply all...

 Laust wrote:

 Dear list,

 I am trying to set up a propensity-weighted regression using the
 survey package. Most of my population is sampled with a sampling
 probability of one (that is, I have the full population). However, for
 a subset of the data I have only a 50% sample of the full population.
 In previous work on the data, I analyzed these data using SAS and
 STATA. In those packages I used a propensity weight of 1/[sampling
 probability] in various generalized linear regression-procedures, but
 I am having trouble setting this up. I bet the solution is simple, but
 I’m a R newbie. Code to illustrate my problem below.

 Hi Laust,

 You probably need the package author to explain fully, but as far as I
 can see, the crux is that a dispersion parameter is being used, based on
 Pearson residuals, even in the Poisson case (i.e. you effectively get
 the same result as with quasipoisson()).

 I don't know what the rationale is for this, but it is clear that with
 your data, an estimated dispersion parameter is going to be large. E.g.
 the data has both 0 cases in 75 person-years and 1000 cases in 5000
 person-years for Denmark, and in your model they are supposed to have
 the same Poisson rate.

 summary.svyglm starts off with

    est.disp - TRUE

 and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
 there is probably a perfectly good reason not to just set the dispersion
 to 1, but I don't get it either...


 Thanks
 Laust

 # loading survey
 library(survey)

 # creating data
 listc -
 c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden)
 listw - c(1,2,1,1,1,1,1,1)
 listd - c(0,0,0,0,1000,1000,1000,2000)
 listt - c(75,50,90,190,5000,5000,5000,1)
 list.cwdt - c(listc, listw, listd, listt)
 country -
 data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)

 # running a frequency weighted regression to get the correct point
 estimates for comparison
 glm - glm(deaths ~ country + offset(log(yrs_at_risk)),
 weights=weight, data=country, family=poisson())
 summary(glm)
 regTermTest(glm, ~ country)

 # running survey weighted regression
 svy - svydesign(~0,,data=country, weight=~weight)
 svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)),
 design=svy, data=country, family=poisson())
 summary(svyglm)
 # point estimates are correct, but standard error is way too large
 regTermTest(svyglm, ~ country)
 # test indicates no country differences

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 --
   O__   Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
 ~~ - (p.dalga...@biostat.ku.dk)              FAX: (+45) 35327907



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] svy / weighted regression

2009-10-12 Thread David Winsemius
I think you are missing the point. You have 4 zero death counts  
associated with much higher person years of exposure followed by 4  
death counts in the thousands associated with lower degrees of  
exposures. It seems unlikely that these are real data as there are not  
cohorts that would exhibit such lower death-rates. So it appears that  
in setting up your test case, you have created an impossibly  
unrealistic test problem.


--
David


On Oct 12, 2009, at 9:12 AM, Laust wrote:


Dear Peter,

Thanks for the input. The zero rates in some strata occurs because
sampling depended on case status: In Finland only 50% of the non-cases
were sampled, while all others were sampled with 100% probability.

Best
Laust

On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard
p.dalga...@biostat.ku.dk wrote:

Sorry, forgot to reply all...

Laust wrote:


Dear list,

I am trying to set up a propensity-weighted regression using the
survey package. Most of my population is sampled with a sampling
probability of one (that is, I have the full population). However,  
for
a subset of the data I have only a 50% sample of the full  
population.

In previous work on the data, I analyzed these data using SAS and
STATA. In those packages I used a propensity weight of 1/[sampling
probability] in various generalized linear regression-procedures,  
but
I am having trouble setting this up. I bet the solution is simple,  
but

I’m a R newbie. Code to illustrate my problem below.


Hi Laust,

You probably need the package author to explain fully, but as far  
as I
can see, the crux is that a dispersion parameter is being used,  
based on

Pearson residuals, even in the Poisson case (i.e. you effectively get
the same result as with quasipoisson()).

I don't know what the rationale is for this, but it is clear that  
with
your data, an estimated dispersion parameter is going to be large.  
E.g.
the data has both 0 cases in 75 person-years and 1000 cases in  
5000

person-years for Denmark, and in your model they are supposed to have
the same Poisson rate.

summary.svyglm starts off with

   est.disp - TRUE

and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
there is probably a perfectly good reason not to just set the  
dispersion

to 1, but I don't get it either...



Thanks
Laust

# loading survey
library(survey)

# creating data
listc -
c 
(Denmark 
,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden)

listw - c(1,2,1,1,1,1,1,1)
listd - c(0,0,0,0,1000,1000,1000,2000)
listt - c(75,50,90,190,5000,5000,5000,1)
list.cwdt - c(listc, listw, listd, listt)
country -
data 
.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)


# running a frequency weighted regression to get the correct point
estimates for comparison
glm - glm(deaths ~ country + offset(log(yrs_at_risk)),
weights=weight, data=country, family=poisson())
summary(glm)
regTermTest(glm, ~ country)

# running survey weighted regression
svy - svydesign(~0,,data=country, weight=~weight)
svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)),
design=svy, data=country, family=poisson())
summary(svyglm)
# point estimates are correct, but standard error is way too large
regTermTest(svyglm, ~ country)
# test indicates no country differences

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
  O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
 c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45)  
35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45)  
35327907





__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] svy / weighted regression

2009-10-10 Thread Peter Dalgaard

Sorry, forgot to reply all...

Laust wrote:

Dear list,

I am trying to set up a propensity-weighted regression using the
survey package. Most of my population is sampled with a sampling
probability of one (that is, I have the full population). However, for
a subset of the data I have only a 50% sample of the full population.
In previous work on the data, I analyzed these data using SAS and
STATA. In those packages I used a propensity weight of 1/[sampling
probability] in various generalized linear regression-procedures, but
I am having trouble setting this up. I bet the solution is simple, but
I’m a R newbie. Code to illustrate my problem below.


Hi Laust,

You probably need the package author to explain fully, but as far as I
can see, the crux is that a dispersion parameter is being used, based on
Pearson residuals, even in the Poisson case (i.e. you effectively get
the same result as with quasipoisson()).

I don't know what the rationale is for this, but it is clear that with
your data, an estimated dispersion parameter is going to be large. E.g.
the data has both 0 cases in 75 person-years and 1000 cases in 5000
person-years for Denmark, and in your model they are supposed to have
the same Poisson rate.

summary.svyglm starts off with

est.disp - TRUE

and AFAICS there is no way it can get set to FALSE.  Knowing Thomas,
there is probably a perfectly good reason not to just set the dispersion
to 1, but I don't get it either...



Thanks
Laust

# loading survey
library(survey)

# creating data
listc - 
c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden)
listw - c(1,2,1,1,1,1,1,1)
listd - c(0,0,0,0,1000,1000,1000,2000)
listt - c(75,50,90,190,5000,5000,5000,1)
list.cwdt - c(listc, listw, listd, listt)
country - data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)

# running a frequency weighted regression to get the correct point
estimates for comparison
glm - glm(deaths ~ country + offset(log(yrs_at_risk)),
weights=weight, data=country, family=poisson())
summary(glm)
regTermTest(glm, ~ country)

# running survey weighted regression
svy - svydesign(~0,,data=country, weight=~weight)
svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)),
design=svy, data=country, family=poisson())
summary(svyglm)
# point estimates are correct, but standard error is way too large
regTermTest(svyglm, ~ country)
# test indicates no country differences

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - (p.dalga...@biostat.ku.dk)  FAX: (+45) 35327907

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] svy / weighted regression

2009-10-09 Thread Laust
Dear list,

I am trying to set up a propensity-weighted regression using the
survey package. Most of my population is sampled with a sampling
probability of one (that is, I have the full population). However, for
a subset of the data I have only a 50% sample of the full population.
In previous work on the data, I analyzed these data using SAS and
STATA. In those packages I used a propensity weight of 1/[sampling
probability] in various generalized linear regression-procedures, but
I am having trouble setting this up. I bet the solution is simple, but
I’m a R newbie. Code to illustrate my problem below.

Thanks
Laust

# loading survey
library(survey)

# creating data
listc - 
c(Denmark,Finland,Norway,Sweden,Denmark,Finland,Norway,Sweden)
listw - c(1,2,1,1,1,1,1,1)
listd - c(0,0,0,0,1000,1000,1000,2000)
listt - c(75,50,90,190,5000,5000,5000,1)
list.cwdt - c(listc, listw, listd, listt)
country - data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt)

# running a frequency weighted regression to get the correct point
estimates for comparison
glm - glm(deaths ~ country + offset(log(yrs_at_risk)),
weights=weight, data=country, family=poisson())
summary(glm)
regTermTest(glm, ~ country)

# running survey weighted regression
svy - svydesign(~0,,data=country, weight=~weight)
svyglm - svyglm(deaths ~ country + offset(log(yrs_at_risk)),
design=svy, data=country, family=poisson())
summary(svyglm)
# point estimates are correct, but standard error is way too large
regTermTest(svyglm, ~ country)
# test indicates no country differences

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.