Hi Joris (CC Simon), Thanks for your kind replies and for being so responsive.
I think this post boils down to two main questions (which I feel are very important for gams modelling): 1- Is it appropriate to use LR tests in "gamm" to test model reduction? 2- If yes, which degrees of freedom should be used? I do not think we should always use the df from "model$lme". For example, compare the two models (again my first example for the data generation): f1 <- gamm(y ~ s(x, k=2, bs="cr"), random = list(id=~1), method="ML" ) f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" ) The difference between the two models is in the random effects. Model "f2" has, if I interpreet correctly the output, 7 random effects more than the model "f1", but the fixed effects are the same. So the H0 = "the 7 random effect are not significant". In this case the (app.) likelihood ratio test should have 7 df... is my interpretation correct? On the other hand, to compare the following models: f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" ) f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" ) Model "f3" has 1 more fixed effect than model "f2", but model "f2" has 7 more random effects... again, if I understand correctly the output. In this case I don't know if we can do a LR test, the model are not strictly nested I think... What do you think? Again many thanks, Carlo > I don't use an LR test for non-nested models, as I fail to formulate a > sensible null hypothesis for such tests. Again, everything I write is > a personal opinion, and inference in the case of these models is still > subject of discussion to date. If you find a plausible way for > explaining the result, by all means use the LR test. > > Personally, I'd go for the AIC / BIC, but these are based on the > likelihood themselves. So in the case where the effective complexity > of the model appears the same, they're completely equivalent to the > likelihood. It's just the inference (i.e. the p-value) I don't trust. > But then again, I'm a cautious statistician. If I'm not sure about a > method, I'd rather don't use it and go with what I know. In my view, > there is not one correct method for a particular problem and/or > dataset. Every method makes assumptions and has shortcomings. Only if > I know which ones, I can take them into account when interpreting the > results. > > It also depends on the focus as well. If the focus is prediction, you > might even want to consider testing whether the variance of the > residuals differs significantly with a simple F-test. This indicates > whether the predictive power differs significantly between the models. > But these tests tend to get very sensitive when you have many > datapoints, rendering them practically useless again. > > So in the end, it always boils down to interpretation. > > Cheers > Joris > > On Fri, Jun 18, 2010 at 10:29 PM, Carlo Fezzi <c.fe...@uea.ac.uk> wrote: >> Thanks Joris, >> >> I understand your point regarding the need for the two models to be >> nested. So, according to your in the example case the LR test is not >> appropriate and the two model should be compared with other criteria >> such >> as AIC or BIC for example. >> >> On the other hand, Simon Wood indicated that such a LR test is >> (approximately) correct in his previous reply... a am bit confused, >> which >> is the correct approach to test the two models? Is the LR test correct >> only if the parametric model is linear in the x variables maybe? In this >> case, which is the best appraoch to compare a "gamm" vs a "lme" with >> quadratic specification? >> >> Best wishes, >> >> Carlo >> >>> Just realized something: You should take into account that the LR test >>> is actually only valid for _nested_ models. Your models are not >>> nested. Hence, you shouldn't use the anova function to compare them, >>> and you shouldn't compare the df. In fact, if you're interested in the >>> contribution of a term, then using anova to compare the model with >>> that term and without that term gives you an answer on the hypothesis >>> whether that term with spline contributes significantly to the model. >>> >>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML") >>> >>>> f3 <- gamm(y ~ x, random = list(id=~1), method="ML" ) >>> >>>> f4 <- gamm(y ~ 1, random = list(id=~1), method="ML" ) >>> >>>> anova(f3$lme,f2$lme) >>> Model df AIC BIC logLik Test L.Ratio p-value >>> f3$lme 1 4 760 770 -376 >>> f2$lme 2 5 381 394 -186 1 vs 2 380 <.0001 >>> >>>> anova(f4$lme,f2$lme) >>> Model df AIC BIC logLik Test L.Ratio p-value >>> f4$lme 1 3 945 953 -470 >>> f2$lme 2 5 381 394 -186 1 vs 2 568 <.0001 >>> >>>> anova(f3$lme,f4$lme) >>> Model df AIC BIC logLik Test L.Ratio p-value >>> f3$lme 1 4 760 770 -376 >>> f4$lme 2 3 945 953 -470 1 vs 2 188 <.0001 >>> >>> This is the correct application of a likelihood ratio test. You see >>> that adding the spline increases the df with 1 compared to the linear >>> model, as part of the spline gets into the random component. Notice as >>> well that the interpretation of a test in case of a random component >>> is not the same as in case of a fixed component. If I understood >>> correctly, this LR test specifically says something over the effect of >>> X, without being interested in the shape of the spline. The >>> "significance of a spline" is a difficult concept anyway, as a spline >>> can be seen as a form of local regression. It's exactly the use of the >>> randomization that allows for a general hypothesis about the added >>> value of the spline, without focusing on its actual shape. Hence the >>> "freedom" connected to that actual shape should not be used in the df >>> used to test the general hypothesis. >>> >>> Hope this makes sense someway... >>> >>> Cheers >>> Joris >>> >>> >>> On Fri, Jun 18, 2010 at 6:27 PM, Carlo Fezzi <c.fe...@uea.ac.uk> wrote: >>>> Dear Simon, >>>> >>>> thanks a lot for your prompt reply. >>>> >>>> Unfortunately I am still confused about which is the correct way to >>>> test >>>> the two models... as you point out: why in my example the two models >>>> have >>>> the same degrees of freedom? >>>> >>>> Intuitively it seems to me the gamm model is more flexible since, as I >>>> understand also from you response, it should contain more random >>>> effects >>>> than the other model because some of the smooth function parameters >>>> are >>>> represented as such. This should not be taken into account when >>>> testing >>>> one model vs the other? >>>> >>>> Continuing with my example, the two models: >>>> >>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML") >>>> f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" ) >>>> >>>> Can be tested with: >>>> >>>> anova(f3$lme,f2$lme) >>>> >>>> But why are the df the same? Model f2 appears to be more flexible and, >>>> as >>>> such, should have more (random) parameters. Should not a test of one >>>> model >>>> vs the other take this into account? >>>> >>>> Sorry if this may sound dull, many thanks for your help, >>>> >>>> Carlo >>>> >>>> >>>> >>>>> On Wednesday 16 June 2010 20:33, Carlo Fezzi wrote: >>>>>> Dear all, >>>>>> >>>>>> I am using the "mgcv" package by Simon Wood to estimate an additive >>>>>> mixed >>>>>> model in which I assume normal distribution for the residuals. I >>>>>> would >>>>>> like to test this model vs a standard parametric mixed model, such >>>>>> as >>>>>> the >>>>>> ones which are possible to estimate with "lme". >>>>>> >>>>>> Since the smoothing splines can be written as random effects, is it >>>>>> correct to use an (approximate) likelihood ratio test for this? >>>>> -- yes this is ok (subject to the usual caveats about testing on the >>>>> boundary >>>>> of the parameter space) but your 2 example models below will have >>>>> the >>>>> same >>>>> number of degrees of freedom! >>>>> >>>>>> If so, >>>>>> which is the correct number of degrees of freedom? >>>>> --- The edf from the lme object, if you are testing using the log >>>>> likelihood >>>>> returned by the lme representation of the model. >>>>> >>>>>> Sometime the function >>>>>> LogLik() seems to provide strange results regarding the number of >>>>>> degrees >>>>>> of freedom (df) for the gam, for instance in the example I copied >>>>>> below >>>>>> the df for the "gamm" are equal to the ones for the "lme", but the >>>>>> summary(model.gam) seems to indicate a much higher edf for the gamm. >>>>> --- the edf for the lme representation of the model counts only the >>>>> fixed >>>>> effects + the variance parameters (which includes smoothing >>>>> parameters). >>>>> Each >>>>> smooth typically contributes only one or two fixed effect parameters, >>>>> with >>>>> the rest of the coefficients for the smooth treated as random >>>>> effects. >>>>> >>>>> --- the edf for the gam representation of the same model differs in >>>>> that >>>>> it >>>>> also counts the *effective* number of parameters used to represent >>>>> each >>>>> smooth: this includes contributions from all those coefficients that >>>>> the >>>>> lme >>>>> representation treated as strictly random. >>>>> >>>>> best, >>>>> Simon >>>>> >>>>> >>>>>> I would be very grateful to anybody who could point out a solution, >>>>>> >>>>>> Best wishes, >>>>>> >>>>>> Carlo >>>>>> >>>>>> Example below: >>>>>> >>>>>> ---- >>>>>> >>>>>> rm(list = ls()) >>>>>> library(mgcv) >>>>>> library(nlme) >>>>>> >>>>>> set.seed(123) >>>>>> >>>>>> x <- runif(100,1,10) # regressor >>>>>> b0 <- rep(rnorm(10,mean=1,sd=2),each=10) # random intercept >>>>>> id <- rep(1:10, each=10) # identifier >>>>>> >>>>>> y <- b0 + x - 0.1 * x^3 + rnorm(100,0,1) # dependent variable >>>>>> >>>>>> f1 <- lme(y ~ x + I(x^2), random = list(id=~1) , method="ML" ) # >>>>>> lme >>>>>> model >>>>>> >>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML" ) # gamm >>>>>> >>>>>> ## same number of "df" according to logLik: >>>>>> logLik(f1) >>>>>> logLik(f2$lme) >>>>>> >>>>>> ## much higher edf according to summary: >>>>>> summary(f2$gam) >>>>>> >>>>>> ----------- >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html and provide commented, >>>>>> minimal, >>>>>> self-contained, reproducible code. >>>>> >>>>> -- >>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 7AY >>>>>> UK >>>>>> +44 1225 386603 www.maths.bath.ac.uk/~sw283 >>>>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> >>> >>> -- >>> Joris Meys >>> Statistical consultant >>> >>> Ghent University >>> Faculty of Bioscience Engineering >>> Department of Applied mathematics, biometrics and process control >>> >>> tel : +32 9 264 59 87 >>> joris.m...@ugent.be >>> ------------------------------- >>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php >>> >> >> >> > > > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Applied mathematics, biometrics and process control > > tel : +32 9 264 59 87 > joris.m...@ugent.be > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.