Bert Gunter Genentech Nonclinical Statistics
> -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] > On Behalf Of Joris Meys > Sent: Monday, June 21, 2010 12:09 PM > To: Carlo Fezzi > Cc: r-help@r-project.org > Subject: Re: [R] {Spam?} Re: mgcv, testing gamm vs lme,which degrees of > freedom? > > Hi Carlo, > > You should get the book of Simon Wood and read it thoroughly. It's all > explained in there, but it will lead me too far to copy it all in a > mail. In short : random effects are part of the error structure of the > model, not of the model itself. They're added to correct the error on > the parameters of the fixed model, and are inherent to the data > structure and not to the hypotheses. Hence, you rarely test their > significance. 1. This discussion probably belongs on the sig-mixed-models list. 2. Your claim is incorrect, I think. The structure of the random errors = model covariance can be parameterized in various ways, and one can try to test significance of nested parameterizations (for a particular fixed effects parameterizaton). Whether you can do so meaningfully especially in the gamm context -- is another issue, but if you check e.g. Bates and Pinheiro, anova for different random effects parameterizations is advocated, and is implemented in the anova.lme() nlme function. 3. But I strongly endorse your suggestion to consult an authoritative resource; I believe it inherently unreasonable (but alas not unusual) that posters somehow expect brief explanations on this list to illuminate and resolve complex statistical issues. -- Bert > > Cheers > Joris > > > On Mon, Jun 21, 2010 at 6:54 PM, Carlo Fezzi <c.fe...@uea.ac.uk> wrote: > > Hi Joris (CC Simon), > > > > Thanks for your kind replies and for being so responsive. > > > > I think this post boils down to two main questions (which I feel are > very > > important for gams modelling): > > > > 1- Is it appropriate to use LR tests in "gamm" to test model reduction? > > 2- If yes, which degrees of freedom should be used? > > > > I do not think we should always use the df from "model$lme". For > example, > > compare the two models (again my first example for the data generation): > > > > f1 <- gamm(y ~ s(x, k=2, bs="cr"), random = list(id=~1), method="ML" ) > > f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" ) > > > > The difference between the two models is in the random effects. Model > "f2" > > has, if I interpreet correctly the output, 7 random effects more than > the > > model "f1", but the fixed effects are the same. So the H0 = "the 7 > random > > effect are not significant". In this case the (app.) likelihood ratio > test > > should have 7 df... is my interpretation correct? > > > > On the other hand, to compare the following models: > > > > f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" ) > > f2 <- gamm(y ~ s(x, k=10, bs="cr"), random = list(id=~1), method="ML" ) > > > > Model "f3" has 1 more fixed effect than model "f2", but model "f2" has 7 > > more random effects... again, if I understand correctly the output. In > > this case I don't know if we can do a LR test, the model are not > strictly > > nested I think... > > > > What do you think? > > > > Again many thanks, > > > > Carlo > > > > > > > >> I don't use an LR test for non-nested models, as I fail to formulate a > >> sensible null hypothesis for such tests. Again, everything I write is > >> a personal opinion, and inference in the case of these models is still > >> subject of discussion to date. If you find a plausible way for > >> explaining the result, by all means use the LR test. > >> > >> Personally, I'd go for the AIC / BIC, but these are based on the > >> likelihood themselves. So in the case where the effective complexity > >> of the model appears the same, they're completely equivalent to the > >> likelihood. It's just the inference (i.e. the p-value) I don't trust. > >> But then again, I'm a cautious statistician. If I'm not sure about a > >> method, I'd rather don't use it and go with what I know. In my view, > >> there is not one correct method for a particular problem and/or > >> dataset. Every method makes assumptions and has shortcomings. Only if > >> I know which ones, I can take them into account when interpreting the > >> results. > >> > >> It also depends on the focus as well. If the focus is prediction, you > >> might even want to consider testing whether the variance of the > >> residuals differs significantly with a simple F-test. This indicates > >> whether the predictive power differs significantly between the models. > >> But these tests tend to get very sensitive when you have many > >> datapoints, rendering them practically useless again. > >> > >> So in the end, it always boils down to interpretation. > >> > >> Cheers > >> Joris > >> > >> On Fri, Jun 18, 2010 at 10:29 PM, Carlo Fezzi <c.fe...@uea.ac.uk> > wrote: > >>> Thanks Joris, > >>> > >>> I understand your point regarding the need for the two models to be > >>> nested. So, according to your in the example case the LR test is not > >>> appropriate and the two model should be compared with other criteria > >>> such > >>> as AIC or BIC for example. > >>> > >>> On the other hand, Simon Wood indicated that such a LR test is > >>> (approximately) correct in his previous reply... a am bit confused, > >>> which > >>> is the correct approach to test the two models? Is the LR test correct > >>> only if the parametric model is linear in the x variables maybe? In > this > >>> case, which is the best appraoch to compare a "gamm" vs a "lme" with > >>> quadratic specification? > >>> > >>> Best wishes, > >>> > >>> Carlo > >>> > >>>> Just realized something: You should take into account that the LR > test > >>>> is actually only valid for _nested_ models. Your models are not > >>>> nested. Hence, you shouldn't use the anova function to compare them, > >>>> and you shouldn't compare the df. In fact, if you're interested in > the > >>>> contribution of a term, then using anova to compare the model with > >>>> that term and without that term gives you an answer on the hypothesis > >>>> whether that term with spline contributes significantly to the model. > >>>> > >>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML") > >>>> > >>>>> f3 <- gamm(y ~ x, random = list(id=~1), method="ML" ) > >>>> > >>>>> f4 <- gamm(y ~ 1, random = list(id=~1), method="ML" ) > >>>> > >>>>> anova(f3$lme,f2$lme) > >>>> Model df AIC BIC logLik Test L.Ratio p-value > >>>> f3$lme 1 4 760 770 -376 > >>>> f2$lme 2 5 381 394 -186 1 vs 2 380 <.0001 > >>>> > >>>>> anova(f4$lme,f2$lme) > >>>> Model df AIC BIC logLik Test L.Ratio p-value > >>>> f4$lme 1 3 945 953 -470 > >>>> f2$lme 2 5 381 394 -186 1 vs 2 568 <.0001 > >>>> > >>>>> anova(f3$lme,f4$lme) > >>>> Model df AIC BIC logLik Test L.Ratio p-value > >>>> f3$lme 1 4 760 770 -376 > >>>> f4$lme 2 3 945 953 -470 1 vs 2 188 <.0001 > >>>> > >>>> This is the correct application of a likelihood ratio test. You see > >>>> that adding the spline increases the df with 1 compared to the linear > >>>> model, as part of the spline gets into the random component. Notice > as > >>>> well that the interpretation of a test in case of a random component > >>>> is not the same as in case of a fixed component. If I understood > >>>> correctly, this LR test specifically says something over the effect > of > >>>> X, without being interested in the shape of the spline. The > >>>> "significance of a spline" is a difficult concept anyway, as a spline > >>>> can be seen as a form of local regression. It's exactly the use of > the > >>>> randomization that allows for a general hypothesis about the added > >>>> value of the spline, without focusing on its actual shape. Hence the > >>>> "freedom" connected to that actual shape should not be used in the df > >>>> used to test the general hypothesis. > >>>> > >>>> Hope this makes sense someway... > >>>> > >>>> Cheers > >>>> Joris > >>>> > >>>> > >>>> On Fri, Jun 18, 2010 at 6:27 PM, Carlo Fezzi <c.fe...@uea.ac.uk> > wrote: > >>>>> Dear Simon, > >>>>> > >>>>> thanks a lot for your prompt reply. > >>>>> > >>>>> Unfortunately I am still confused about which is the correct way to > >>>>> test > >>>>> the two models... as you point out: why in my example the two models > >>>>> have > >>>>> the same degrees of freedom? > >>>>> > >>>>> Intuitively it seems to me the gamm model is more flexible since, as > I > >>>>> understand also from you response, it should contain more random > >>>>> effects > >>>>> than the other model because some of the smooth function parameters > >>>>> are > >>>>> represented as such. This should not be taken into account when > >>>>> testing > >>>>> one model vs the other? > >>>>> > >>>>> Continuing with my example, the two models: > >>>>> > >>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML") > >>>>> f3 <- gamm(y ~ x + I(x^2), random = list(id=~1), method="ML" ) > >>>>> > >>>>> Can be tested with: > >>>>> > >>>>> anova(f3$lme,f2$lme) > >>>>> > >>>>> But why are the df the same? Model f2 appears to be more flexible > and, > >>>>> as > >>>>> such, should have more (random) parameters. Should not a test of one > >>>>> model > >>>>> vs the other take this into account? > >>>>> > >>>>> Sorry if this may sound dull, many thanks for your help, > >>>>> > >>>>> Carlo > >>>>> > >>>>> > >>>>> > >>>>>> On Wednesday 16 June 2010 20:33, Carlo Fezzi wrote: > >>>>>>> Dear all, > >>>>>>> > >>>>>>> I am using the "mgcv" package by Simon Wood to estimate an > additive > >>>>>>> mixed > >>>>>>> model in which I assume normal distribution for the residuals. I > >>>>>>> would > >>>>>>> like to test this model vs a standard parametric mixed model, such > >>>>>>> as > >>>>>>> the > >>>>>>> ones which are possible to estimate with "lme". > >>>>>>> > >>>>>>> Since the smoothing splines can be written as random effects, is > it > >>>>>>> correct to use an (approximate) likelihood ratio test for this? > >>>>>> -- yes this is ok (subject to the usual caveats about testing on > the > >>>>>> boundary > >>>>>> of the parameter space) but your 2 example models below will have > >>>>>> the > >>>>>> same > >>>>>> number of degrees of freedom! > >>>>>> > >>>>>>> If so, > >>>>>>> which is the correct number of degrees of freedom? > >>>>>> --- The edf from the lme object, if you are testing using the log > >>>>>> likelihood > >>>>>> returned by the lme representation of the model. > >>>>>> > >>>>>>> Sometime the function > >>>>>>> LogLik() seems to provide strange results regarding the number of > >>>>>>> degrees > >>>>>>> of freedom (df) for the gam, for instance in the example I copied > >>>>>>> below > >>>>>>> the df for the "gamm" are equal to the ones for the "lme", but the > >>>>>>> summary(model.gam) seems to indicate a much higher edf for the > gamm. > >>>>>> --- the edf for the lme representation of the model counts only the > >>>>>> fixed > >>>>>> effects + the variance parameters (which includes smoothing > >>>>>> parameters). > >>>>>> Each > >>>>>> smooth typically contributes only one or two fixed effect > parameters, > >>>>>> with > >>>>>> the rest of the coefficients for the smooth treated as random > >>>>>> effects. > >>>>>> > >>>>>> --- the edf for the gam representation of the same model differs in > >>>>>> that > >>>>>> it > >>>>>> also counts the *effective* number of parameters used to represent > >>>>>> each > >>>>>> smooth: this includes contributions from all those coefficients > that > >>>>>> the > >>>>>> lme > >>>>>> representation treated as strictly random. > >>>>>> > >>>>>> best, > >>>>>> Simon > >>>>>> > >>>>>> > >>>>>>> I would be very grateful to anybody who could point out a > solution, > >>>>>>> > >>>>>>> Best wishes, > >>>>>>> > >>>>>>> Carlo > >>>>>>> > >>>>>>> Example below: > >>>>>>> > >>>>>>> ---- > >>>>>>> > >>>>>>> rm(list = ls()) > >>>>>>> library(mgcv) > >>>>>>> library(nlme) > >>>>>>> > >>>>>>> set.seed(123) > >>>>>>> > >>>>>>> x <- runif(100,1,10) # regressor > >>>>>>> b0 <- rep(rnorm(10,mean=1,sd=2),each=10) # random intercept > >>>>>>> id <- rep(1:10, each=10) # identifier > >>>>>>> > >>>>>>> y <- b0 + x - 0.1 * x^3 + rnorm(100,0,1) # dependent variable > >>>>>>> > >>>>>>> f1 <- lme(y ~ x + I(x^2), random = list(id=~1) , method="ML" ) # > >>>>>>> lme > >>>>>>> model > >>>>>>> > >>>>>>> f2 <- gamm(y ~ s(x), random = list(id=~1), method="ML" ) # gamm > >>>>>>> > >>>>>>> ## same number of "df" according to logLik: > >>>>>>> logLik(f1) > >>>>>>> logLik(f2$lme) > >>>>>>> > >>>>>>> ## much higher edf according to summary: > >>>>>>> summary(f2$gam) > >>>>>>> > >>>>>>> ----------- > >>>>>>> > >>>>>>> ______________________________________________ > >>>>>>> R-help@r-project.org mailing list > >>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>>>> PLEASE do read the posting guide > >>>>>>> http://www.R-project.org/posting-guide.html and provide commented, > >>>>>>> minimal, > >>>>>>> self-contained, reproducible code. > >>>>>> > >>>>>> -- > >>>>>>> Simon Wood, Mathematical Sciences, University of Bath, Bath, BA2 > 7AY > >>>>>>> UK > >>>>>>> +44 1225 386603 www.maths.bath.ac.uk/~sw283 > >>>>>> > >>>>> > >>>>> ______________________________________________ > >>>>> R-help@r-project.org mailing list > >>>>> https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting guide > >>>>> http://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>>> > >>>> > >>>> > >>>> > >>>> -- > >>>> Joris Meys > >>>> Statistical consultant > >>>> > >>>> Ghent University > >>>> Faculty of Bioscience Engineering > >>>> Department of Applied mathematics, biometrics and process control > >>>> > >>>> tel : +32 9 264 59 87 > >>>> joris.m...@ugent.be > >>>> ------------------------------- > >>>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > >>>> > >>> > >>> > >>> > >> > >> > >> > >> -- > >> Joris Meys > >> Statistical consultant > >> > >> Ghent University > >> Faculty of Bioscience Engineering > >> Department of Applied mathematics, biometrics and process control > >> > >> tel : +32 9 264 59 87 > >> joris.m...@ugent.be > >> ------------------------------- > >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > >> > > > > > > > > > > -- > Joris Meys > Statistical consultant > > Ghent University > Faculty of Bioscience Engineering > Department of Applied mathematics, biometrics and process control > > tel : +32 9 264 59 87 > joris.m...@ugent.be > ------------------------------- > Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting- > guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.