Thanks John, That's just the explanation I was looking for. I had hoped that there would be a built in way of dealing with them with R, but obviously not.
Given that explanation, it stills seems to me that the way R calculates n is suboptimal, as demonstrated by my second example: summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) the weights are only very slightly different but the estimates of residual standard error are quite different (20 vs 14 in my run) Hadley On 5/8/07, John Fox <[EMAIL PROTECTED]> wrote: > Dear Hadley, > > I think that the problem is that the term "weights" has different meanings, > which, although they are related, are not quite the same. > > The weights used by lm() are (inverse-)"variance weights," reflecting the > variances of the errors, with observations that have low-variance errors > therefore being accorded greater weight in the resulting WLS regression. > What you have are sometimes called "case weights," and I'm unaware of a > general way of handling them in R, although you could regenerate the > unaggregated data. As you discovered, you get the same coefficients with > case weights as with variance weights, but different standard errors. > Finally, there are "sampling weights," which are inversely proportional to > the probability of selection; these are accommodated by the survey package. > > To complicate matters, this terminology isn't entirely standard. > > I hope this helps, > John > > -------------------------------- > John Fox, Professor > Department of Sociology > McMaster University > Hamilton, Ontario > Canada L8S 4M4 > 905-525-9140x23604 > http://socserv.mcmaster.ca/jfox > -------------------------------- > > > -----Original Message----- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of hadley wickham > > Sent: Tuesday, May 08, 2007 5:09 AM > > To: R Help > > Subject: [R] Weighted least squares > > > > Dear all, > > > > I'm struggling with weighted least squares, where something > > that I had assumed to be true appears not to be the case. > > Take the following data set as an example: > > > > df <- data.frame(x = runif(100, 0, 100)) df$y <- df$x + 1 + > > rnorm(100, sd=15) > > > > I had expected that: > > > > summary(lm(y ~ x, data=df, weights=rep(2, 100))) summary(lm(y > > ~ x, data=rbind(df,df))) > > > > would be equivalent, but they are not. I suspect the > > difference is how the degrees of freedom is calculated - I > > had expected it to be sum(weights), but seems to be > > sum(weights > 0). This seems unintuitive to me: > > > > summary(lm(y ~ x, data=df, weights=rep(c(0,2), each=50))) > > summary(lm(y ~ x, data=df, weights=rep(c(0.01,2), each=50))) > > > > What am I missing? And what is the usual way to do a linear > > regression when you have aggregated data? > > > > Thanks, > > > > Hadley > > > > ______________________________________________ > > [email protected] mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
