Denis Chabot said the following on 2005-09-29 21:55: > OK, I think I understand better but still have two points to clarify. > > The first one is about the number of df. I think those who replied on > this objected to the way I chose df, not the fact that I would run a > model with 7.4 df per se. If I read you correctly, I artificially > reduce my p-value by using the estimated df found in a mgcv gam model > into another model where I fix df. This is fine, I am quite willing not > to run a second model with a fixed df and instead tell my readers that > my model is "marginally significant" with a p-value of 0.03. > > This being said, do you know of guidelines for choosing df? A colleague > told me he does not go above 10% of the number of points. Should I be > concerned when mgcv estimates 7.4 df for 34 points? Note that for this > particular model P < 1e-16, and P is also very small if I fix df to > either 4 or 7. > > My second point is the difference between models fitted by packages gam > and mgcv. Sure, some of you have said "different algorithms". And when
Maybe that wasn't well put. Think of it as different implementations. The packages `mgcv' and `gam' are by no means the only implementations of GAM; see e.g. the `gss' package. > I specify dfs, shouldn't P-values be very similar for the 2 packages? > If not, what does it say of the confidence we can have in the models? Since there's no universally accepted definition of what a GAM is, you shouldn't expect the results to be the same. > I draw your attention to this exampl: I obtained P-values of 0.50 and > 0.03 with packages gam and mgcv respectively: In this case, however, you're trying to compare apples to oranges... > > library(gam) > Loading required package: splines > > data(kyphosis) > > kyp1 <- gam(Kyphosis ~ s(Number, 3), family=binomial, data=kyphosis) > > summary.gam(kyp1) > > Call: gam(formula = Kyphosis ~ s(Number, 3), family = binomial, data = > kyphosis) > Deviance Residuals: > Min 1Q Median 3Q Max > -1.3646 -0.6233 -0.4853 -0.3133 2.0965 > > (Dispersion Parameter for binomial family taken to be 1) > > Null Deviance: 83.2345 on 80 degrees of freedom > Residual Deviance: 71.9973 on 76.9999 degrees of freedom > AIC: 79.9976 > > Number of Local Scoring Iterations: 7 > > DF for Terms and Chi-squares for Nonparametric Effects > > Df Npar Df Npar Chisq P(Chi) > (Intercept) 1 > s(Number, 3) 1 2 1.37149 0.50375 This test concerns only the non-linear part of the term s(Number, 3). In order to simultaneously test both the linear and non-linear part, as mgcv::summary.gam does, you'd > kyp1.1 <- gam(Kyphosis ~ 1, family=binomial, data=kyphosis) > anova(kyp1.1, kyp1, test = "Chisq") Analysis of Deviance Table Model 1: Kyphosis ~ 1 Model 2: Kyphosis ~ s(Number, 3) Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 80.0000 83.234 2 76.9999 71.997 3.0001 11.237 0.011 HTH, Henric > > detach(package:gam) > > library(mgcv) > This is mgcv 1.3-7 > > kyp2 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, > data=kyphosis) > > summary.gam(kyp2) > > Family: binomial > Link function: logit > > Formula: > Kyphosis ~ s(Number, k = 4, fx = T) > > Parametric coefficients: > Estimate Std. Error z value Pr(>|z|) > (Intercept) -1.5504 0.3342 -4.64 3.49e-06 *** > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Approximate significance of smooth terms: > edf Est.rank Chi.sq p-value > s(Number) 3 3 8.898 0.0307 * > --- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > R-sq.(adj) = 0.101 Deviance explained = 12.5% > UBRE score = 0.075202 Scale est. = 1 n = 81 > > kyp2$deviance > [1] 72.85893 > > kyp2$null.deviance > [1] 83.23447 > > kyp2$df.null > [1] 80 > > kyp2$df.residual > [1] 77 > > How can we explain this huge difference? > > Denis > > >> Le 05-09-29 à 06:00, [EMAIL PROTECTED] a écrit : >> >> De : Henric Nilsson <[EMAIL PROTECTED]> >> Date : 29 septembre 2005 03:55:19 HAE >> À : [EMAIL PROTECTED] >> Cc : [email protected] >> Objet : Rép : [R] p-level in packages mgcv and gam >> Répondre à : Henric Nilsson <[EMAIL PROTECTED]> >> >> >> Yves Magliulo said the following on 2005-09-28 17:05: >> >>> hi, >>> i'll try to help you, i send a mail about this subject last week... and >>> i did not have any response... >>> I'm using gam from package mgcv. 1) >>> How to interpret the significance of smooth terms is hard for me to >>> understand perfectly : using UBRE, you fix df. p-value are estimated >>> by chi-sq distribution using GCV, the best df are estimated by GAM. >>> (that's what i want) and >>> p-values >> >> >> >> This is not correct. The df are estimated in both cases (i.e. UBRE >> and GCV), but the scale parameter is fixed in the UBRE case. Hence, >> by default UBRE is used for family = binomial or poisson since the >> scale parameter is assumed to be 1. Similarly, GCV is the default for >> family = gaussian since we most often want the scale (usually denoted >> sigma^2) to be estimated. >> >> >>> are estimated by an F distribution But in that case they said "use at >>> your own risk" in ?summary.gam >> >> >> >> The warning applies in both cases. The p-values are conditional on >> the smoothing parameters, and the uncertainty of the smooths is not >> taken into account when computing the p-values. >> >> >>> so you can also look at the chi.sq : but i don't know how to choose a >> >> >> >> No... >> >> >>> criterion like for p-values... for me, chi.sq show the best >>> predictor in >>> a model, but it's hard to reject one with it. >> >> >> >> Which version of mgcv do you use? The confusion probably stems from >> earlier versions of mgcv (< 1.3-5): the summary and anova methods >> used to have a column denoted Chi.sq even when the displayed >> statistic was computed as F. Recent versions of mgcv has >> >> > summary(b) >> >> Family: gaussian >> Link function: identity >> >> Formula: >> y ~ s(x0) + s(x1) + s(x2) + s(x3) >> >> Parametric coefficients: >> Estimate Std. Error t value Pr(>|t|) >> (Intercept) 7.9150 0.1049 75.44 <2e-16 *** >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> Approximate significance of smooth terms: >> edf Est.rank F p-value >> s(x0) 5.173 9.000 3.785 0.000137 *** >> s(x1) 2.357 9.000 34.631 < 2e-16 *** >> s(x2) 8.517 9.000 84.694 < 2e-16 *** >> s(x3) 1.000 1.000 0.444 0.505797 >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> R-sq.(adj) = 0.726 Deviance explained = 73.7% >> GCV score = 4.611 Scale est. = 4.4029 n = 400 >> >> >> If we assume that the scale is known and fixed at 4.4029, we get >> >> > summary(b, dispersion = 4.4029) >> >> Family: gaussian >> Link function: identity >> >> Formula: >> y ~ s(x0) + s(x1) + s(x2) + s(x3) >> >> Parametric coefficients: >> Estimate Std. Error z value Pr(>|z|) >> (Intercept) 7.9150 0.1049 75.44 <2e-16 *** >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> Approximate significance of smooth terms: >> edf Est.rank Chi.sq p-value >> s(x0) 5.173 9.000 34.067 8.7e-05 *** >> s(x1) 2.357 9.000 311.679 < 2e-16 *** >> s(x2) 8.517 9.000 762.255 < 2e-16 *** >> s(x3) 1.000 1.000 0.444 0.505 >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> R-sq.(adj) = 0.726 Deviance explained = 73.7% >> GCV score = 4.611 Scale est. = 4.4029 n = 400 >> >> Note that t/F changed into z/Chi.sq. >> >> >>> so as far as i m concerned, i use GCV methods, and fix a 5% on the null >>> hypothesis (pvalue) to select significant predictor. after, i look >>> at my >>> smooth, and if the parametrization look fine to me, i validate. >>> generaly, for p-values smaller than 0.001, you can be confident. over >>> 0.001, you have to check. 2) >>> for difference between package gam and mgcv, i sent a mail about this >> >> >> >> The underlying algorithms are very different. >> >> >> HTH, >> Henric >> >> >> De : "Liaw, Andy" <[EMAIL PROTECTED]> >> Date : 28 septembre 2005 14:01:25 HAE >> À : 'Denis Chabot' <[EMAIL PROTECTED]>, Peter Dalgaard >> <[EMAIL PROTECTED]> >> Cc : Thomas Lumley <[EMAIL PROTECTED]>, R list <r- >> [EMAIL PROTECTED]> >> Objet : RE: [R] p-level in packages mgcv and gam >> >> Just change the df in what Thomas described to the degree of >> polynomial, and >> everything he said still applies. Any good book on regression that >> covers >> polynomial regression ought to point this out. >> >> Andy >> >> >>> From: Denis Chabot >>> >>> But what about another analogy, that of polynomials? You may not be >>> sure what degree polynomial to use, and you have not decided before >>> analysing your data. You fit different polynomials to your data, >>> checking if added degrees increase r2 sufficiently by doing F-tests. >>> >>> I thought it was the same thing with GAMs. You can fit a >>> model with 4 >>> df, and in some cases it is of interest to see if this is a better >>> fit than a linear fit. But why can't you also check if 7df is better >>> than 4df? And if you used mgcv first and it tells you that 7df is >>> better than 4df, why bother repeating the comparison 7df >>> against 4df, >>> why not just take the p-value for the model with 7df (fixed)? >>> >>> Denis >> >> > > >> De : Peter Dalgaard <[EMAIL PROTECTED]> >> Date : 28 septembre 2005 12:04:58 HAE >> À : Thomas Lumley <[EMAIL PROTECTED]> >> Cc : Denis Chabot <[EMAIL PROTECTED]>, R list <r- >> [EMAIL PROTECTED]> >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> Thomas Lumley <[EMAIL PROTECTED]> writes: >> >>> >>> Bob, on the other hand, chooses the amount of smoothing depending on >>> the data. When a 4 df smooth fits best he ends up with the same model >>> as Alice and the same p-value. When some other df fits best he ends >>> up with a different model and a *smaller* p-value than Alice. >> >> >> This doesn't actually follow, unless the p-value (directly or >> indirectly) found its way into the definition of "best fit". It does >> show the danger, though. >> >> -- >> O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B >> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K >> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) >> 35327918 >> ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) >> 35327907 >> > >> De : Thomas Lumley <[EMAIL PROTECTED]> >> Date : 26 septembre 2005 12:54:43 HAE >> À : Denis Chabot <[EMAIL PROTECTED]> >> Cc : [email protected] >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> On Mon, 26 Sep 2005, Denis Chabot wrote: >> >> But the mgcv manual warns that p-level for the smooth can be >> underestimated when df are estimated by the model. Most of the time >> my p-levels are so small that even doubling them would not result in >> a value close to the P=0.05 threshold, but I have one case with P=0.033. >> >> I thought, probably naively, that running a second model with fixed >> df, using the value of df found in the first model. I could not >> achieve this with mgcv: its gam function does not seem to accept >> fractional values of df (in my case 8.377). >> >> No, this won't work. The problem is the usual one with model >> selection: the p-value is calculated as if the df had been fixed, >> when really it was estimated. >> >> It is likely to be quite hard to get an honest p-value out of >> something that does adaptive smoothing. >> >> -thomas >> De : Thomas Lumley <[EMAIL PROTECTED]> >> Date : 28 septembre 2005 11:33:27 HAE >> À : Denis Chabot <[EMAIL PROTECTED]> >> Cc : R list <[email protected]> >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> On Wed, 28 Sep 2005, Denis Chabot wrote: >> >> I only got one reply to my message: >> >> No, this won't work. The problem is the usual one with model >> selection: the p-value is calculated as if the df had been fixed, >> when really it was estimated. >> >> It is likely to be quite hard to get an honest p-value out of >> something that does adaptive smoothing. >> >> -thomas >> >> I do not understand this: it seems that a lot of people chose df=4 >> for no particular reason, but p-levels are correct. If instead I >> choose df=8 because a previous model has estimated this to be an >> optimal df, P-levels are no good because df are estimated? >> >> Yes. I know this sounds strange initially, but it really does make >> sense if you think about it carefully. >> >> Suppose that Alice and Bob are kyphosis researchers, and that Alice >> always chooses 4df for smoothing Age. We would all agree that her >> p-values are correct [in fact we wouldn't, but that is a separate issue] >> >> Bob, on the other hand, chooses the amount of smoothing depending on >> the data. When a 4 df smooth fits best he ends up with the same model >> as Alice and the same p-value. When some other df fits best he ends >> up with a different model and a *smaller* p-value than Alice. >> >> In particular, this is still true under the null hypothesis that Age >> has no effect [If Alice and Bob are interested in p-values, the null >> hypothesis must be plausible.] >> >> If Bob's p-values are always less than or equal to Alice's p-values >> under the null hypothesis, and Alice's p-values are less than 0.05 5% >> of the time, then Bob's p-values are less than 0.05 more than 5% of >> the time. >> >> >> -thomas >> >> >> Furthermore, shouldn't packages gam and mgcv give similar results >> when the same data and df are used? I tried this: >> >> library(gam) >> data(kyphosis) >> kyp1 <- gam(Kyphosis ~ s(Age, 4), family=binomial, data=kyphosis) >> kyp2 <- gam(Kyphosis ~ s(Number, 4), family=binomial, data=kyphosis) >> kyp3 <- gam(Kyphosis ~ s(Start, 4), family=binomial, data=kyphosis) >> anova.gam(kyp1) >> anova.gam(kyp2) >> anova.gam(kyp3) >> >> detach(package:gam) >> library(mgcv) >> kyp4 <- gam(Kyphosis ~ s(Age, k=4, fx=T), family=binomial, >> data=kyphosis) >> kyp5 <- gam(Kyphosis ~ s(Number, k=4, fx=T), family=binomial, >> data=kyphosis) >> kyp6 <- gam(Kyphosis ~ s(Start, k=4, fx=T), family=binomial, >> data=kyphosis) >> anova.gam(kyp4) >> anova.gam(kyp5) >> anova.gam(kyp6) >> >> >> P levels for these models, by pair >> >> kyp1 vs kyp4: p= 0.083 and 0.068 respectively (not too bad) >> kyp2 vs kyp5: p= 0.445 and 0.03 (wow!) >> kyp3 vs kyp6: p= 0.053 and 0.008 (wow again) >> >> Also if you plot all these you find that the mgcv plots are smoother >> than the gam plots, even the same df are used all the time. >> >> I am really confused now! >> >> Denis >> > >> >> Thomas Lumley Assoc. Professor, Biostatistics >> [EMAIL PROTECTED] University of Washington, SeattleDe : >> Thomas Lumley <[EMAIL PROTECTED]> >> Date : 28 septembre 2005 14:35:26 HAE >> À : Denis Chabot <[EMAIL PROTECTED]> >> Cc : Peter Dalgaard <[EMAIL PROTECTED]>, R list <r- >> [EMAIL PROTECTED]> >> Objet : Rép : [R] p-level in packages mgcv and gam >> >> On Wed, 28 Sep 2005, Denis Chabot wrote: >> >> But what about another analogy, that of polynomials? You may not be >> sure what degree polynomial to use, and you have not decided before >> analysing your data. You fit different polynomials to your data, >> checking if added degrees increase r2 sufficiently by doing F-tests. >> >> Yes, you can. And this procedure gives you incorrect p-values. >> >> They may not be very incorrect -- it depends on how much model >> selection you do, and how strongly the feature you are selecting on >> is related to the one you are testing. >> >> For example, using step() to choose a polynomial in x even when x is >> unrelated to y and z inflates the Type I error rate by giving a >> biased estimate of the residual mean squared error: >> >> once<-function(){ >> y<-rnorm(50);x<-runif(50);z<-rep(0:1,25) >> summary(step(lm(y~z), >> scope=list(lower=~z,upper=~z+x+I(x^2)+I(x^3)+I(x^4)), >> trace=0))$coef["z",4] >> } >> p<-replicate(1000,once()) >> mean(p<0.05) >> [1] 0.072 >> >> which is significantly higher than you would expect for an honest >> level 0.05 test. >> >> -thomas > > ______________________________________________ [email protected] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
