Dear Raffaele, Using your code, with one modification -- setting the seed for R's random number generator to make the result reproducible -- I get:
> set.seed(12345) . . . > lmMod <- lm(yvar~xvar) > print(summary(lmMod)) Call: lm(formula = yvar ~ xvar) Residuals: Min 1Q Median 3Q Max -4.0293 -0.6732 0.0021 0.6749 4.2883 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.0057713 0.0057529 174.8 <2e-16 *** xvar 2.0000889 0.0009998 2000.4 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.9964 on 29998 degrees of freedom Multiple R-squared: 0.9926, Adjusted R-squared: 0.9926 F-statistic: 4.002e+06 on 1 and 29998 DF, p-value: < 2.2e-16 which is more or less what one would expect. My guess: you've saved your R workspace from a previous session, and it is then loaded at the start of your R session; something in the saved workspace is affecting the result, although frankly I can't think what that might be. I hope this helps, John ----------------------------------------------------------------- John Fox Professor Emeritus McMaster University Hamilton, Ontario, Canada Web: https://socialsciences.mcmaster.ca/jfox/ > -----Original Message----- > From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Raffa > Sent: Saturday, May 25, 2019 8:38 AM > To: r-help@r-project.org > Subject: [R] Increasing number of observations worsen the regression model > > I have the following code: > > ``` > > rm(list=ls()) > N = 30000 > xvar <- runif(N, -10, 10) > e <- rnorm(N, mean=0, sd=1) > yvar <- 1 + 2*xvar + e > plot(xvar,yvar) > lmMod <- lm(yvar~xvar) > print(summary(lmMod)) > domain <- seq(min(xvar), max(xvar)) # define a vector of x values to feed > into model lines(domain, predict(lmMod, newdata = > data.frame(xvar=domain))) # add regression line, using `predict` to > generate > y-values > > ``` > > I expected the coefficients to be something similar to [1,2]. Instead R keeps > throwing at me random numbers that are not statistically significant and don't > fit the model, and I have 20k observations. For example > > ``` > > Call: > lm(formula = yvar ~ xvar) > > Residuals: > Min 1Q Median 3Q Max > -21.384 -8.908 1.016 10.972 23.663 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 0.0007145 0.0670316 0.011 0.991 > xvar 0.0168271 0.0116420 1.445 0.148 > > Residual standard error: 11.61 on 29998 degrees of freedom Multiple R- > squared: 7.038e-05, Adjusted R-squared: 3.705e-05 > F-statistic: 2.112 on 1 and 29998 DF, p-value: 0.1462 > > ``` > > > The strange thing is that the code works perfectly for N=200 or N=2000. > It's only for larger N that this thing happen U(for example, N=20000). I have > tried to ask for example in CrossValidated > <https://stats.stackexchange.com/questions/410050/increasing-number-of- > observations-worsen-the-regression-model> > but the code works for them. Any help? > > I am runnign R 3.6.0 on Kubuntu 19.04 > > Best regards > > Raffaele > > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.