Re: [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Juliet Hannah Wed, 24 Apr 2013 12:41:18 -0700

Hi Jan and Simon,

If possible, could you attach the diagnostic plots. I would be curious to
see them.


Thanks,

Juliet


On Fri, Apr 19, 2013 at 4:39 AM, jholstei <jan.holst...@awi.de> wrote:

> Simon,
>
> that was very instructivevery special thanks to you.
> I already noticed that the model was bad, but it was not clear to me that
> transformation of predictors to, say a more centered distribution is
> helpful here.
> And thanks for pointing out Tweedie, I noticed that the error structure is
> far from normal and more like gamma or poisson, but Gamma made things worse.
>
> Best regards,
> Jan
>
>
>
>
>
> Am 18 Apr 2013 um 17:25 schrieb Simon Wood:
>
> > Jan,
> >
> > Thanks for the data (off list). The p-value computations are based on
> the approximation that things are approximately normal on the linear
> predictor scale, but actually they are no where close to normal in this
> case, which is why the p-values look inconsistent. The reason that the
> approximate normality assumption doesn't hold is that the model is quite a
> poor fit. If you take a look at gam.check(fit) you'll see that the constant
> variance assumption of quasi(link=log) is violated quite badly, and the
> residual distribution is really quite odd (plot residuals against fitted as
> well). Also see plot(fit,pages=1,scale=0) - it shows ballooning confidence
> intervals and smooth estimates that are so low in places that they might as
> well be minus infinity (given log link) - clearly something is wrong with
> this model!
> >
> > I would be inclined to reset all the 0's to 0 (rather than 0.01), and
> then to try Tweedie(p=1.5,link=log) as the family. Also the predictor
> variables are very skewed which is giving leverage problems, so I would
> transform them to give less skew. e.g. Something like
> >
> > fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)),
> > family=Tweedie(p=1.6,link=log),data=df,method="REML")
> >
> > gives a model that is closer to being reasonable (p-values are then
> consistent between select=TRUE and FALSE).
> >
> > best,
> > Simon
> >
> > On 18/04/13 14:24, Simon Wood wrote:
> >> Jan,
> >>
> >> Thanks for this. Is there any chance that you could send me the data off
> >> list and I'll try to figure out what is happening? (Under the
> >> understanding that I'll only use the data for investigating this issue,
> >> of course).
> >>
> >> best,
> >> Simon
> >>
> >> on 18/04/13 11:11, Jan Holstein wrote:
> >>> Simon,
> >>>
> >>> thanks for the reply,  I guess I'm pretty much up to date using
> >>>  mgcv 1.7-22.
> >>> Upgrading to R 3.0.0 also didn't do any change.
> >>>
> >>> Unfortunately using method="REML" does not make any difference:
> >>>
> >>> ####### first with "select=FALSE"
> >>>> fit<-gam(target
> >>>>
> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F)
> >>>>
> >>>> summary(fit)
> >>>
> >>> Family: quasi
> >>> Link function: log
> >>> Formula:
> >>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
> >>> Parametric coefficients:
> >>>             Estimate Std. Error t value Pr(>|t|)
> >>> (Intercept)   -4.724      7.462  -0.633    0.527
> >>> Approximate significance of smooth terms:
> >>>             edf Ref.df      F p-value
> >>> s(mgs)    3.118  3.492  0.099   0.974
> >>> s(gsd)    6.377  7.044 15.596  <2e-16 ***
> >>> s(mud)    8.837  8.971 18.832  <2e-16 ***
> >>> s(ssCmax) 3.886  4.051  2.342   0.052 .
> >>> ---
> >>> Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
> >>> R-sq.(adj) =  0.403   Deviance explained = 40.6%
> >>> REML score =  33186  Scale est. = 8.7812e+05  n = 4511
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> #### Then using "select=T"
> >>>
> >>>> fit2<-gam(target
> >>>>
> ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=TRUE)
> >>>>
> >>>> summary(fit2)
> >>> Family: quasi
> >>> Link function: log
> >>> Formula:
> >>> target ~ s(mgs) + s(gsd) + s(mud) + s(ssCmax)
> >>> Parametric coefficients:
> >>>             Estimate Std. Error t value Pr(>|t|)
> >>> (Intercept)   -6.406      5.239  -1.223    0.222
> >>> Approximate significance of smooth terms:
> >>>             edf Ref.df     F p-value
> >>> s(mgs)    2.844      8 25.43  <2e-16 ***
> >>> s(gsd)    6.071      9 14.50  <2e-16 ***
> >>> s(mud)    6.875      8 21.79  <2e-16 ***
> >>> s(ssCmax) 3.787      8 18.42  <2e-16 ***
> >>> ---
> >>> Signif. codes:  0 *** 0.001 ** 0.01 * 0.05 . 0.1   1
> >>> R-sq.(adj) =    0.4   Deviance explained = 40.1%
> >>> REML score =  33203  Scale est. = 8.8359e+05  n = 4511
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> I played around with other families/link functions with no success
> >>> regarding
> >>> the "select" behaviour.
> >>>
> >>> Well, look at the structure of my data:
> >>> <http://r.789695.n4.nabble.com/file/n4664586/screen-capture-1.png>
> >>>
> >>> All possible predictor variables in principle look like this, and taken
> >>> alone, each and every is significant according to p-value (but not all
> >>> can
> >>> at the same time).
> >>> In theory, the target variable should be a hypersurface in 11dim space
> >>> with
> >>> lots of noise, but interaction of more than 2 vars gets costly (not to
> >>> think
> >>> of 11) and often enough (also without interaction) the solution does
> not
> >>> converge at minimal step size. If it does, results are usually not as
> >>> good
> >>> as without interaction.
> >>>
> >>> Any comment/advice on model setup is warmly welcome here.
> >>>
> >>> Since I don't want to try out all possible 2047 combinations of up to
> >>> eleven
> >>> predictor variables for each target variable, I currently see no other
> >>> way
> >>> than educated manual guessing.
> >>>
> >>> If you know another way of (semi-)automated model tunig/reduction, I
> >>> would
> >>> very much appreciate it
> >>>
> >>> best regards,
> >>> Jan
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>
> >>> --
> >>> View this message in context:
> >>>
> http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4664586.html
> >>>
> >>> Sent from the R help mailing list archive at Nabble.com.
> >>>
> >>> ______________________________________________
> >>> R-help@r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>
> >>
> >
> >
> > --
> > Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
> > +44 (0)1225 386603               http://people.bath.ac.uk/sw283
>
>
>         [[alternative HTML version deleted]]
>
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mgcv: how select significant predictor vars when using gam(...select=TRUE) using automatic optimization

Reply via email to