Hi Jan and Simon, If possible, could you attach the diagnostic plots. I would be curious to see them.
Thanks, Juliet On Fri, Apr 19, 2013 at 4:39 AM, jholstei <jan.holst...@awi.de> wrote: > Simon, > > that was very instructivevery special thanks to you. > I already noticed that the model was bad, but it was not clear to me that > transformation of predictors to, say a more centered distribution is > helpful here. > And thanks for pointing out Tweedie, I noticed that the error structure is > far from normal and more like gamma or poisson, but Gamma made things worse. > > Best regards, > Jan > > > > > > Am 18 Apr 2013 um 17:25 schrieb Simon Wood: > > > Jan, > > > > Thanks for the data (off list). The p-value computations are based on > the approximation that things are approximately normal on the linear > predictor scale, but actually they are no where close to normal in this > case, which is why the p-values look inconsistent. The reason that the > approximate normality assumption doesn't hold is that the model is quite a > poor fit. If you take a look at gam.check(fit) you'll see that the constant > variance assumption of quasi(link=log) is violated quite badly, and the > residual distribution is really quite odd (plot residuals against fitted as > well). Also see plot(fit,pages=1,scale=0) - it shows ballooning confidence > intervals and smooth estimates that are so low in places that they might as > well be minus infinity (given log link) - clearly something is wrong with > this model! > > > > I would be inclined to reset all the 0's to 0 (rather than 0.01), and > then to try Tweedie(p=1.5,link=log) as the family. Also the predictor > variables are very skewed which is giving leverage problems, so I would > transform them to give less skew. e.g. Something like > > > > fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)), > > family=Tweedie(p=1.6,link=log),data=df,method="REML") > > > > gives a model that is closer to being reasonable (p-values are then > consistent between select=TRUE and FALSE). > > > > best, > > Simon > > > > On 18/04/13 14:24, Simon Wood wrote: > >> Jan, > >> > >> Thanks for this. Is there any chance that you could send me the data off > >> list and I'll try to figure out what is happening? (Under the > >> understanding that I'll only use the data for investigating this issue, > >> of course). > >> > >> best, > >> Simon > >> > >> on 18/04/13 11:11, Jan Holstein wrote: > >>> Simon, > >>> > >>> thanks for the reply, I guess I'm pretty much up to date using > >>> mgcv 1.7-22. > >>> Upgrading to R 3.0.0 also didn't do any change. > >>> > >>> Unfortunately using method="REML" does not make any difference: > >>> > >>> ####### first with "select=FALSE" > >>>> fit<-gam(target > >>>> > ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F) > >>>> > >>>> summary(fit) > >>> > >>> Family: quasi > >>> Link function: log > >>> Formula: > >>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax) > >>> Parametric coefficients: > >>> Estimate Std. Error t value Pr(>|t|) > >>> (Intercept) -4.724 7.462 -0.633 0.527 > >>> Approximate significance of smooth terms: > >>> edf Ref.df F p-value > >>> s(mgs) 3.118 3.492 0.099 0.974 > >>> s(gsd) 6.377 7.044 15.596 <2e-16 *** > >>> s(mud) 8.837 8.971 18.832 <2e-16 *** > >>> s(ssCmax) 3.886 4.051 2.342 0.052 . > >>> --- > >>> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > >>> R-sq.(adj) = 0.403 Deviance explained = 40.6% > >>> REML score = 33186 Scale est. = 8.7812e+05 n = 4511 > >>> > >>> > >>> > >>> > >>> > >>> #### Then using "select=T" > >>> > >>>> fit2<-gam(target > >>>> > ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=TRUE) > >>>> > >>>> summary(fit2) > >>> Family: quasi > >>> Link function: log > >>> Formula: > >>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax) > >>> Parametric coefficients: > >>> Estimate Std. Error t value Pr(>|t|) > >>> (Intercept) -6.406 5.239 -1.223 0.222 > >>> Approximate significance of smooth terms: > >>> edf Ref.df F p-value > >>> s(mgs) 2.844 8 25.43 <2e-16 *** > >>> s(gsd) 6.071 9 14.50 <2e-16 *** > >>> s(mud) 6.875 8 21.79 <2e-16 *** > >>> s(ssCmax) 3.787 8 18.42 <2e-16 *** > >>> --- > >>> Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 > >>> R-sq.(adj) = 0.4 Deviance explained = 40.1% > >>> REML score = 33203 Scale est. = 8.8359e+05 n = 4511 > >>> > >>> > >>> > >>> > >>> > >>> > >>> > >>> I played around with other families/link functions with no success > >>> regarding > >>> the "select" behaviour. > >>> > >>> Well, look at the structure of my data: > >>> <http://r.789695.n4.nabble.com/file/n4664586/screen-capture-1.png> > >>> > >>> All possible predictor variables in principle look like this, and taken > >>> alone, each and every is significant according to p-value (but not all > >>> can > >>> at the same time). > >>> In theory, the target variable should be a hypersurface in 11dim space > >>> with > >>> lots of noise, but interaction of more than 2 vars gets costly (not to > >>> think > >>> of 11) and often enough (also without interaction) the solution does > not > >>> converge at minimal step size. If it does, results are usually not as > >>> good > >>> as without interaction. > >>> > >>> Any comment/advice on model setup is warmly welcome here. > >>> > >>> Since I don't want to try out all possible 2047 combinations of up to > >>> eleven > >>> predictor variables for each target variable, I currently see no other > >>> way > >>> than educated manual guessing. > >>> > >>> If you know another way of (semi-)automated model tunig/reduction, I > >>> would > >>> very much appreciate it > >>> > >>> best regards, > >>> Jan > >>> > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> View this message in context: > >>> > http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4664586.html > >>> > >>> Sent from the R help mailing list archive at Nabble.com. > >>> > >>> ______________________________________________ > >>> R-help@r-project.org mailing list > >>> https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting guide > >>> http://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >>> > >> > >> > > > > > > -- > > Simon Wood, Mathematical Science, University of Bath BA2 7AY UK > > +44 (0)1225 386603 http://people.bath.ac.uk/sw283 > > > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.