[R] How to automate this model selection algorithm?

Andrew Crane-Droesch Mon, 18 Mar 2013 17:08:48 -0700

I've got a complicated semi-parametric model that I'm fitting withmgcv. I start with a model based on theory. Its got lots ofinteraction terms. I want to winnow it down: removing each interactionterm or un-interacted main effect one by one, checking the AIC, andretaining the model that gives me the lowest AIC. I then want to repeatthe procedure on the retained model.


Here is a simple example:


    set.seed(42)
    library(mgcv)
    N=220
    fac = ceiling(runif(N)*2)
    a = rnorm(N); b = rnorm(N); c = rnorm(N); d = runif(N); e = rnorm(N);
    y = a^fac + b*e + d/(e+1)

    m1 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b,c)
    +    te(a,d,by=as.factor(fac))
    )

    m2 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(b,c)
    +    te(a,d,by=as.factor(fac))
    )

    m3 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,c)
    +    te(a,d,by=as.factor(fac))
    )

    m4 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b)
    +    te(a,d,by=as.factor(fac))
    )

    m5 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b,c)
    +    te(d,by=as.factor(fac))
    )

    m6 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b,c)
    +    te(a,by=as.factor(fac))
    )

    m7 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b,c)
    +    te(a,d)
    )

    selection = AIC(m1,m2,m3,m4,m5,m6,m7)
    selection

df      AIC

m1 14.53435 1626.611

m2 12.52501 1622.635

m3 12.54566 1622.615

m4 12.52652 1622.633

m5 13.14108 1620.759

m6 10.99684 1621.156

m7 10.98136 1622.229

m5 is the best

    m5 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b,c)
    +    te(d,by=as.factor(fac))
    )

    m5.1 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(b,c)
    +    te(d,by=as.factor(fac))
    )

    m5.2 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,c)
    +    te(d,by=as.factor(fac))
    )
    m5.3 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b)
    +    te(d,by=as.factor(fac))
    )

    m5.4 = gam(y~     as.factor(fac)
    +    s(a)
    +    s(b)
    +    s(c)
    +    s(d)
    +    te(a,b,c)
    #+    te(d,by=as.factor(fac))
    )

    selection.2 = AIC(m5,m5.1,m5.2,m5.3,m5.4)
    selection.2

df      AIC

m5   13.363029 1621.183

m5.1  9.671656 1617.641

m5.2  9.730047 1617.549

m5.3  9.706424 1617.569

m5.4  9.857504 1620.028

5.2 is the best

...and so on. I'd next try out each interaction term or un-interactedmain effect in m5.2. Question is how I can automate this? To startwith a model (m1, in my case), and have R give me the best model afterrunning this algorithm to the point where there are no longer any"better" models according to this algorithm?

Right now I can do it manually, but as I add data over time, modelselection may change.

Note the mgcv's "select=TRUE" functionality doesn't quite work for me(at least not directly), because I want to see if single parts ofthree-way interactions can/should be removed.

Thanks in advance for any tips, and apologies for cross-posting (onstack overflow).


Best,
Andrew

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How to automate this model selection algorithm?

Reply via email to