Jean Bréfort wrote: > One other totally unrelated thing. We got recently a bug report about an > incorrect R squared in gnumeric regression code > (http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0) > give the same result as Gnumeric as can be seen below: > > >> mydata <- read.csv(file="data.csv",sep=",") >> mydata >> > X Y > 1 1 2 > 2 2 4 > 3 3 5 > 4 4 8 > 5 5 0 > 6 6 7 > 7 7 8 > 8 8 9 > 9 9 10 > >> summary(lm(mydata$Y~mydata$X)) >> > > Call: > lm(formula = mydata$Y ~ mydata$X) > > Residuals: > Min 1Q Median 3Q Max > -5.8889 0.2444 0.5111 0.7111 2.9778 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.5556 1.8587 0.837 0.4303 > mydata$X 0.8667 0.3303 2.624 0.0342 * > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 2.559 on 7 degrees of freedom > Multiple R-squared: 0.4958, Adjusted R-squared: 0.4238 > F-statistic: 6.885 on 1 and 7 DF, p-value: 0.03422 > > >> summary(lm(mydata$Y~mydata$X-1)) >> > > Call: > lm(formula = mydata$Y ~ mydata$X - 1) > > Residuals: > Min 1Q Median 3Q Max > -5.5614 0.1018 0.3263 1.6632 3.5509 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) > mydata$X 1.1123 0.1487 7.481 7.06e-05 *** > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 2.51 on 8 degrees of freedom > Multiple R-squared: 0.8749, Adjusted R-squared: 0.8593 > F-statistic: 55.96 on 1 and 8 DF, p-value: 7.056e-05 > > I am unable to figure out what this 0.8749 value might represent. If it > is intended to be the Pearson moment, it should be 0.4958, and if it is > the coefficient of determination, I think the correct value would be > 0.4454 as given by Excel. It's of course nice to have the same result in > R and Gnumeric,but it would be better if this result was accurate (if it > is, we need some documentation fix). Btw, I am not a statistics expert > at all. > This horse has been flogged multiple times on the list.
It is of course mainly a matter of convention, but the convention used by R has been around at least since Genstat in the mid-1970s. In the no-intercept case, you get the _uncentered_ version of R-squared; that is, the proportion of the sum of squares explained by the model (as opposed to sum of squares of _deviations_ in the usual case.) The rationale is that the R^2 should be based on a reduction in residual variation between two nested models, and if theres no intercept, the only well-determined nested model is the one where mydata$Y has mean zero for all x corresponding to all-zero regression coefficients. The resulting R^2 is directly related to the F statistic, which you'll see is also larger and more significant when the intercept is removed. BTW: lm(mydata$Y~mydata$X) is bad practice, use lm(Y~X, data=mydata). Use of predict() will demonstrate why. -- O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel