Dear all,

I have some questions regarding the validity an implementation of
statistical tests based on linear models and loess. I've searched the
R-help arhives and found several informative threads that related to my
questions, but there are still a few issues I'm not clear about. I'd be
grateful for guidance.

Background and data set:
    I wish to compare the growth and metabolism of 2 strains of bacteria. I
have grown the 2 strains in 3 replicates on 2 different occasions and on
each occasion taken hourly measurements of the population density of
bacteria (represented by "od" (optical density)) and the concentrations of
several chemicals, such as glycerol, in their growth medium (environment).

    The data looks like this:

        > head(df.yx.t)
           uid  cid date  g t        od   glycerol    alanine
        6    6 Ma,1   25 Ma 1 0.1428571 3025750049 7843841013
        7    7 Ma,1   25 Ma 2 0.4542857 3026668036 7902016686
        8    8 Ma,1   25 Ma 3 1.4542857 3086406597 8017589237
        9    9 Ma,1   25 Ma 4 2.5866667 2821935918 6595302338
        10  10 Ma,1   25 Ma 5 3.0933333 2674142017 6144141491
        11  11 Ma,1   25 Ma 6 3.6000000 3026824144 7009788499

        > summary(df.yx.t)
              uid           cid     date        g            t
od            glycerol            alanine
         6      :  1   Ma,1   :10   25:60   Ma   :60   1      :12   Min.
:0.1171   Min.   :7.305e+08   Min.   :9.858e+07
         7      :  1   Ma,2   :10   26:60   RILI2:60   2      :12   1st
Qu.:1.3921   1st Qu.:1.757e+09   1st Qu.:3.199e+09
         8      :  1   Ma,3   :10                      3      :12   Median
:3.8042   Median :2.541e+09   Median :5.664e+09
         9      :  1   Ma,4   :10                      4      :12   Mean
:3.7822   Mean   :2.354e+09   Mean   :5.264e+09
         10     :  1   Ma,5   :10                      5      :12   3rd
Qu.:5.9495   3rd Qu.:3.026e+09   3rd Qu.:7.699e+09
         11     :  1   Ma,6   :10                      6      :12   Max.
:8.7375   Max.   :3.296e+09   Max.   :8.501e+09
         (Other):114   (Other):60
(Other):48

    uid = unique id for an observation
    cid = culture id (a within subjects factor identifying each bacterial
culture)
    date = date of culture; data was collected in 2 batches, with a
balanced split of observations (3 replicates of each genotype in each batch)
    g = genotype (the name of the bactrial strain)
    t = time (1-10 hours)
    od = optical density (a proxy for poulation density of the bacterial)
    glycerol = glycerol "concentration"
    alanine = alanine "concentration"

I have tested whether Genotype affects the growth (od) of the bacteria
using lm() and anova().

    #define the model
    m1<-lm(log10(od)~date*g*t,data=df.yx.t)
    #perform anova
    > anova(m1)
    Analysis of Variance Table

    Response: log10(od)
              Df Sum Sq Mean Sq   F value    Pr(>F)
    date       1  0.006  0.0056    5.3763   0.02297 *
    g          1  0.107  0.1066  103.2951 4.646e-16 ***
    t          9 33.646  3.7385 3620.9069 < 2.2e-16 ***
    date:g     1  0.003  0.0030    2.8731   0.09396 .
    date:t     9  0.015  0.0017    1.6542   0.11417
    g:t        9  0.021  0.0023    2.2329   0.02796 *
    date:g:t   9  0.005  0.0005    0.4950   0.87383
    Residuals 80  0.083  0.0010
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

So far so good. (Although I do wonder whether I should be using a within
subjects predictor variable (e.g. via car::Anova).) However, I also want to
test whether genotype affects the concentrations of metabolites, e.g.
glycerol, in the growth medium at a given population density. I envisage
testing this by one of 2 methods, neither of which I am confident is
correct, despite digging aound in various fora and statistics texts.

Method 1. Loess splines
    I could fit a loess curve to glycerol~od, (the date effect disappears
when plotting against od rather than t).

        m1<-loess(glycerol~od, data=df.yx.t[df.yx.t$g=="Ma",])
        m2<-loess(glycerol~od, data=df.yx.t[df.yx.t$g=="RILI2",])

    However, I'm not confident that I can perform valid statistic tests
based on this.

    I could try to test the whether there is a significant difference in
glycerol concentration at any given od as follows:

        #test for a difference at od=7
        v.pred<-7
        pred1<-predict(m1,newdata=data.frame(od=v.pred),se=T)
        pred2<-predict(m2,newdata=data.frame(od=v.pred),se=T)


 
dfit<-data.frame(g=c("Ma","RILI2"),fit=c(pred1$fit,pred2$fit),se=c(pred1$fit,pred2$fit))
        alpha<-0.05
        dfit$lo<-with(dfit,fit-se*abs(qnorm(alpha/2)))
        dfit$hi<-with(dfit,fit+se*abs(qnorm(alpha/2)))

    Is this legitimate?

    I might also try to test whether the curves are significantly different
using anova(m1,m2), though again I'm not sure of the validity of this
approach since the 2 models are based on different (and different numbers
of) observations.



Method 2. Linear models

            m2<-lm(glycerol~g*od*date,data=df.yx.t)

        The use of a linear model is complicated by 2 considerations:

        1) od is a continuous variable. If my limited understanding of
linear models and ANOVA is correct, then m1 above tests for a linear
relationship between od and the response variable (e.g. glycerol). The
response of glycerol to OD is not linear over the full range of od. Does
this preclude using linear models?
        I'm tempted to avoid this issue by redefining the model as:

            m3<-lm(glycerol/od~g*t*date,data=df.yx.t)

        Thereby enabling me to perform a factorial ANOVA.

        2) There are missing values in the metabolite data, making the data
unbalanced.
        I have read that one can use type II sum of squares in these
circumstance, which I use without further adjusting for the differences in
group sizes:
            car::Anova(m2,type=2)


Thank you for getting this far. Any guidance would be much appreciated!


Peter Davenport

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to