On Jul 3, 2010, at 9:00 AM, David Winsemius wrote: > > On Jul 2, 2010, at 11:33 PM, Paulo Barata wrote: > >> >> Dear R-list members, >> >> I would like to pose a question about the use and results >> of the glm() function for logistic regression calculations. >> >> The question is based on an example provided on p. 229 >> in P. Dalgaard, Introductory Statistics with R, 2nd. edition, >> Springer, 2008. By means of this example, I was trying to >> practice the different ways of entering data in glm(). >> >> In his book, Dalgaard provides data from a case-based study >> about hypertension summarized in the form of a table. He shows >> two ways of entering the response (dependent) variable data >> in glm(): (1) as a matrix of successes/failures (diseased/ >> healthy); (2) as the proportion of people diseased for each >> combination of independent variable's categories. >> >> I tried to enter the response variable in glm() in a third >> way: by reconstructing, from the table, the original data >> in a case-based format, that is, a data frame in which >> each row shows the data for one person. In this situation, >> the response variable would be coded as a numeric 0/1 vector, >> 0=failure, 1=success, and so it would be entered in glm() as >> a numeric 0/1 vector. >> >> The program below presents the calculations for each of the >> three ways of entering data - the first and second methods >> were just copied from Dalgaard's book. >> >> The three methods produce the same results with regard >> to the estimated coefficients, when the output is seen >> with five decimals (although regression 3 seems to have >> produced slightly different std.errors). >> >> My main question is: Why are the residual deviance, its >> degrees of freedom and the AIC produced by regression 3 >> completely different when compared to those produced by >> regressions 1 and 2 (which seem to produce equal results)? >> It seems that the degrees of freedom in regressions 1 >> and 2 are based on the size (number of rows) of table d >> (see the output of the program below), but this table is >> just a way of summarizing the data. The degrees of >> freedom in regressions 1 and 2 are not based on the >> actual number of cases (people) examined, which is n=433. > > I first encountered this phenomenon 25 years ago when using GLIM. The answer > from my statistical betters was that the deviance is actually only > established up to a constant and that it is only differences in deviance that > can be properly interpreted. The same situation exists with indefinite > integrals in calculus. >> >> I understand that no matter the way of entering the data >> in glm(), we are always analyzing the same data, which >> are those presented in table format on Dalgaard's page >> 229 (these data are in data.frame d in the program below). >> So I understand that the three ways of entering data >> in glm() should produce the same results. > > The differences between equivalent nested models should remain the same (up > to numerical accuracy). > > 411.42 on 432 degrees of freedom > -398.92 on 429 > ----------------- > 12.5 3 > > 14.1259 on 7 degrees of freedom > -1.6184 on 4 > -------------- > 12.5075 3 > >> >> Secondarily, why are the std.errors in regression 3 slightly >> different from those calculated in regressions 1 and 2? > > You mean the differences 4 places to the right of the decimal??? > >> >> I am using R version 2.11.1 on Windows XP. >> >> Thank you very much. >> >> Paulo Barata >> >> ##== begin ================================================= >> >> ## data in: P. Dalgaard, Introductory Statistics with R, >> ## 2nd. edition, Springer, 2008 >> ## logistic regression - example in Dalgaard's Section 13.2, >> ## page 229 >> >> rm(list=ls()) > > Personally, I rather annoyed when people post this particular line without > commenting it out. It is basically saying that your code is terribly much > more important than whatever else might be in a user's workspace. > >> >> ## data provided on Dalgaard's page 229: >> no.yes <- c("No","Yes") >> smoking <- gl(2,1,8,no.yes) >> obesity <- gl(2,2,8,no.yes) >> snoring <- gl(2,4,8,no.yes) >> n.tot <- c(60,17,8,2,187,85,51,23) >> n.hyp <- c(5,2,1,0,35,13,15,8) >> >> d <- data.frame(smoking,obesity,snoring,n.tot,n.hyp) >> ## d is the data to be analyzed, in table format >> ## d is the first table on Dalgaard's page 229 >> ## n.tot = total number of cases >> ## n.hyp = people with hypertension >> d >> >> ## regression 1: Dalgaard's page 230 >> ## response variable entered in glm() as a matrix of >> ## successes/failures >> hyp.tbl <- cbind(n.hyp,n.tot-n.hyp) >> regression1 <- glm(hyp.tbl~smoking+obesity+snoring, >> family=binomial("logit")) >> >> ## regression 2: Dalgaard's page 230 >> ## response variable entered in glm() as proportions >> prop.hyp <- n.hyp/n.tot >> regression2 <- glm(prop.hyp~smoking+obesity+snoring, >> weights=n.tot,family=binomial("logit")) >> >> ## regression 3 (well below): data entered in glm() >> ## by means of 'reconstructed' variables >> ## variables with names beginning with 'r' are >> ## 'reconstructed' from data in data.frame d. >> ## The objective is to reconstruct the original >> ## data from which the table on Dalgaard's page 229 >> ## has been produced >> >> rsmoking <- c(rep('No',d[1,4]),rep('Yes',d[2,4]), >> rep('No',d[3,4]),rep('Yes',d[4,4]), >> rep('No',d[5,4]),rep('Yes',d[6,4]), >> rep('No',d[7,4]),rep('Yes',d[8,4])) >> rsmoking <- factor(rsmoking) >> length(rsmoking) # just a check >> >> robesity <- c(rep('No', d[1,4]),rep('No', d[2,4]), >> rep('Yes',d[3,4]),rep('Yes',d[4,4]), >> rep('No', d[5,4]),rep('No', d[6,4]), >> rep('Yes',d[7,4]),rep('Yes',d[8,4])) >> robesity <- factor(robesity) >> length(robesity) # just a check >> >> rsnoring <- c(rep('No', d[1,4]),rep('No', d[2,4]), >> rep('No', d[3,4]),rep('No', d[4,4]), >> rep('Yes',d[5,4]),rep('Yes',d[6,4]), >> rep('Yes',d[7,4]),rep('Yes',d[8,4])) >> rsnoring <- factor(rsnoring) >> length(rsnoring) # just a check >> >> rhyp <- c(rep(1,d[1,5]),rep(0,d[1,4]-d[1,5]), >> rep(1,d[2,5]),rep(0,d[2,4]-d[2,5]), >> rep(1,d[3,5]),rep(0,d[3,4]-d[3,5]), >> rep(1,d[4,5]),rep(0,d[4,4]-d[4,5]), >> rep(1,d[5,5]),rep(0,d[5,4]-d[5,5]), >> rep(1,d[6,5]),rep(0,d[6,4]-d[6,5]), >> rep(1,d[7,5]),rep(0,d[7,4]-d[7,5]), >> rep(1,d[8,5]),rep(0,d[8,4]-d[8,5])) >> length(rhyp) # just a check >> >> sum(n.tot) # just a check >> >> ## data.frame(rsmoking,robesity,rsnoring,rhyp) would provide >> ## the data in a case-based format - each row would present >> ## data for one case (one person), with response variable >> ## coded 0/1 for No/Yes >> ## The five first cases (people in the study): >> data.frame(rsmoking,robesity,rsnoring,rhyp)[1:5,] >> >> ## regression 3: response variable entered in glm() >> ## as a numeric 0/1 vector >> regression3 <- glm(rhyp~rsmoking+robesity+rsnoring, >> family=binomial("logit")) >> >> summary(regression1) >> summary(regression2) >> summary(regression3) >> >> ## note different residual deviance, degrees of freedom >> ## and AIC between regression 3 and regressions 1 and 2. >> >> ##== end =================================================
FWIW, this recent (last month) reply from Doug on a similar query, may be helpful: http://tolstoy.newcastle.edu.au/R/e10/help/10/06/7780.html HTH, Marc Schwartz ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.