Thank you Bert, I see your point. The problem is that I am not sure if this is a statistical issue - i.e., there is something wrong with my procedure - or, in a manner of speaking, a software issue: e.g. lm is not expected to return the same estimates for different approaches because of the way in which the function estimates the solution, and it should not be used the way I do. Maybe I should have better stressed this. Valerio
On Thu, Oct 6, 2022 at 4:31 PM Bert Gunter <bgunter.4...@gmail.com> wrote: > > You could get lucky here, but strictly speaking, this list is about R > programming and statistical issues are typically off topic Someone might > respond privately, though. > > Cheers, > Bert > > On Thu, Oct 6, 2022 at 4:24 AM Valerio Leone Sciabolazza > <sciabola...@gmail.com> wrote: >> >> Good morning, >> I am trying to use R to estimate a fixed effects model (i.e., a panel >> regression model controlling for unobserved time-invariant >> heterogeneities across agents) using different estimation approaches >> (e.g. replicating xtreg from Stata, see e.g. >> https://www.stata.com/support/faqs/statistics/intercept-in-fixed-effects-model/). >> I have already asked this question on different stacks exchange forums >> and contacted package creators who dealt with this issue before, but I >> wasn't able to obtain an answer to my doubts. >> I hope to have better luck on this list. >> >> Let me introduce the problem, and note that I am using an unbalanced panel. >> >> The easiest way to estimate my fixed effect model is using the function lm. >> >> Example: >> >> # load packages >> library(dplyr) >> # set seed for replication purposes >> set.seed(123) >> # create toy dataset >> x <- rnorm(4000) >> x2 <- rnorm(length(x)) >> id <- factor(sample(500,length(x),replace=TRUE)) >> firm <- data.frame(id = id) %>% >> group_by(id) %>% >> mutate(firm = 1:n()) %>% >> pull(firm) >> id.eff <- rlnorm(nlevels(id)) >> firm.eff <- rexp(length(unique(firm))) >> y <- x + 0.25*x2 + id.eff[id] + firm.eff[firm] + rnorm(length(x)) >> db = data.frame(y = y, x = x, id = id, firm = firm) >> rm <- db %>% group_by(id) %>% summarise(firm = max(firm)) %>% >> filter(firm == 1) %>% pull(id) >> db = db[-which(db$id %in% rm), ] >> # Run regression >> test <- lm(y ~ x + id, data = db) >> >> Another approach is demeaning the variables included into the model >> specification. >> In this way, one can exclude the fixed effects from the model. Of >> course, point estimates will be correct, while standard errors will be >> not (because we are not accounting for the degrees of freedom used in >> the demeaning). >> >> # demean data >> dbm <- as_tibble(db) %>% >> group_by(id) %>% >> mutate(y = y - mean(y), >> x = x - mean(x)) %>% >> ungroup() >> # run regression >> test2 <- lm(y ~ x, data = dbm) >> # compare results >> summary(test)$coefficients[2,1] >> > 0.9753364 >> summary(test2)$coefficients[2,1] >> > 0.9753364 >> >> Another way to do this is to demean the variables and add their grand >> average (I believe that this is what xtreg from Stata does) >> >> # create data >> n = length(unique(db$id)) >> dbh <- dbm %>% >> mutate(yh = y + (sum(db$y)/n), >> xh = x + (sum(db$x)/n)) >> # run regression >> test3 <- lm(yh ~ xh, dbh) >> # compare results >> summary(test)$coefficients[2,1] >> > 0.9753364 >> summary(test2)$coefficients[2,1] >> > 0.9753364 >> summary(test3)$coefficients[2,1] >> > 0.9753364 >> >> As one can see, the three approaches report the same point estimates >> (again, standard errors will be different instead). >> >> When I include an additional set of fixed effects in the model >> specification, the three approaches no longer return the same point >> estimate. However, differences seem to be negligible and they could be >> due to rounding. >> >> db$firm <- as.factor(db$firm) >> dbm$firm <- as.factor(dbm$firm) >> dbh$firm <- as.factor(dbh$firm) >> testB <- lm(y ~ x + id + firm, data = db) >> testB2 <- lm(y ~ x + firm, data = dbm) >> testB3 <- lm(yh ~ xh + firm, data = dbh) >> summary(testB)$coefficients[2,1] >> > 0.9834414 >> summary(testB2)$coefficients[2,1] >> > 0.9833334 >> summary(testB3)$coefficients[2,1] >> > 0.9833334 >> >> A similar behavior occurs if I use a dummy variable rather than a >> continous one. For the only purpose of the example, I show this by >> transforming my target variable x from a continuous to a dummy >> variable. >> >> # create data >> x3 <- ifelse(db$x > 0, 1, 0) >> db <- db %>% mutate(x3 = x3) >> dbm <- dbm %>% >> mutate(x3 = x3) %>% >> group_by(id) %>% >> mutate(x3 = x3 - mean(x3)) %>% >> ungroup() >> dbh <- dbh %>% mutate(x3 = dbm$x3) %>% >> mutate(x3 = x3 + (sum(db$x3)/n)) %>% >> ungroup() >> # Run regressions >> testC <- lm(y ~ x3 + id + firm, data = db) >> testC2 <- lm(y ~ x3 + firm, data = dbm) >> testC3 <- lm(yh ~ x3 + firm, data = dbh) >> summary(testC)$coefficients[2, 1] >> > 1.579883 >> summary(testC2)$coefficients[2, 1] >> > 1.579159 >> summary(testC3)$coefficients[2, 1] >> > 1.579159 >> >> Now, I want to estimate both the impact of x when this is higher than >> 0 (i.e., x3) and when this is lower or equal to zero (call it x4). >> Again, observe that x3 might as well be a real dummy variable, not a >> transformation of a continuous variable. >> >> In order to do that, I exclude the intercept from my model. >> Specifically, I do the following: >> >> # create data >> x4 <- ifelse(db$x <= 0, 1, 0) >> db <- db %>% mutate(x4 = x4) >> dbm <- dbm %>% >> mutate(x4 = x4) %>% >> group_by(id) %>% >> mutate(x4 = x4 - mean(x4)) %>% >> ungroup() >> dbh <- dbh %>% mutate(x4 = dbm$x4) %>% >> mutate(x4 = x4 + (sum(db$x4)/n)) %>% >> ungroup() >> testD <- lm(y ~ x3 + x4 + id + firm - 1, data = db) >> testD2 <- lm(y ~ x3 + x4 + firm - 1, data = dbm) >> testD3 <- lm(yh ~ x3 + x4 + firm - 1, data = dbh) >> summary(testD)$coefficients[1:2, ] >> > 1.2372816 -0.3426011 >> summary(testD2) >> > Call: >> lm(formula = y ~ x3 + x4 + firm - 1, data = dbm) >> >> Residuals: >> Min 1Q Median 3Q Max >> -3.8794 -0.7497 0.0010 0.7442 3.8486 >> >> Coefficients: (1 not defined because of singularities) >> Estimate Std. Error t value Pr(>|t|) >> x3 1.57916 0.03779 41.788 < 2e-16 *** >> x4 NA NA NA NA >> ... redacted >> summary(testD3)$coefficients[1:2] >> > 3.254654 1.675495 >> >> As you can see, the second approach is not able to estimate the impact >> of x4 on y. At the same time, the first and the third approach return >> very different point estimates. >> >> Is anyone able to explain me why I cannot obtain the same point >> estimates for this last exercise? >> >> Is there anything wrong in the way I include the second set of fixed effects? >> Is there anything wrong in the way I include the variables x3 and x4? >> Or this is simply a problem due to some internal functions in R? >> >> Any hint would be much appreciated. >> >> Best, >> Valerio >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.