Dear Raffaele,

Using your code, with one modification -- setting the seed for R's random 
number generator to make the result reproducible -- I get:

> set.seed(12345)

. . .

> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))

lm(formula = yvar ~ xvar)

    Min      1Q  Median      3Q     Max 
-4.0293 -0.6732  0.0021  0.6749  4.2883 

             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.0057713  0.0057529   174.8   <2e-16 ***
xvar        2.0000889  0.0009998  2000.4   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9964 on 29998 degrees of freedom
Multiple R-squared:  0.9926,    Adjusted R-squared:  0.9926 
F-statistic: 4.002e+06 on 1 and 29998 DF,  p-value: < 2.2e-16

which is more or less what one would expect.

My guess: you've saved your R workspace from a previous session, and it is then 
loaded at the start of your R session; something in the saved workspace is 
affecting the result, although frankly I can't think what that might be.

I hope this helps,

John Fox
Professor Emeritus
McMaster University
Hamilton, Ontario, Canada

> -----Original Message-----
> From: R-help [] On Behalf Of Raffa
> Sent: Saturday, May 25, 2019 8:38 AM
> To:
> Subject: [R] Increasing number of observations worsen the regression model
> I have the following code:
> ```
> rm(list=ls())
> N = 30000
> xvar <- runif(N, -10, 10)
> e <- rnorm(N, mean=0, sd=1)
> yvar <- 1 + 2*xvar + e
> plot(xvar,yvar)
> lmMod <- lm(yvar~xvar)
> print(summary(lmMod))
> domain <- seq(min(xvar), max(xvar))    # define a vector of x values to feed
> into model lines(domain, predict(lmMod, newdata =
> data.frame(xvar=domain)))    # add regression line, using `predict` to 
> generate
> y-values
> ```
> I expected the coefficients to be something similar to [1,2]. Instead R keeps
> throwing at me random numbers that are not statistically significant and don't
> fit the model, and I have 20k observations. For example
> ```
> Call:
> lm(formula = yvar ~ xvar)
> Residuals:
>      Min      1Q  Median      3Q     Max
> -21.384  -8.908   1.016  10.972  23.663
> Coefficients:
>               Estimate Std. Error t value Pr(>|t|)
> (Intercept) 0.0007145  0.0670316   0.011    0.991
> xvar        0.0168271  0.0116420   1.445    0.148
> Residual standard error: 11.61 on 29998 degrees of freedom Multiple R-
> squared:  7.038e-05,    Adjusted R-squared: 3.705e-05
> F-statistic: 2.112 on 1 and 29998 DF,  p-value: 0.1462
> ```
> The strange thing is that the code works perfectly for N=200 or N=2000.
> It's only for larger N that this thing happen U(for example, N=20000). I have
> tried to ask for example in CrossValidated
> <
> observations-worsen-the-regression-model>
> but the code works for them. Any help?
> I am runnign R 3.6.0 on Kubuntu 19.04
> Best regards
> Raffaele
