Re: [R-sig-phylo] Model Selection and PGLS

Russell Engelman Sat, 03 Jul 2021 00:12:16 -0700

>
> C is the full variance-covariance matrix, made from the tree that includes
> both the known and unknown species. CCompletedata is the
> variance-covariance matrix WITHOUT the unknown species. This corresponds to
> C in Ted and Tony's paper.  cihmat is the matrix of relationships of the
> unknown species to the known species, one row for each unknown species. So
> you cannot replace cihmat with C.



So how would one go about double-checking the original data to see how well
the fitted values for the PGLS model predict the known original values if
cihmat is not replaceable with C? What would be really nice is if there was
some way to return the values for all the species, both new species and old
ones, and then create a separate column using the call
"ifelse(is.na(dat$y)==TRUE,TRUE,FALSE)"
that could be used to call either the old data or the new ones.

I've been trying the code but there still seems to be some unusual results
when I apply it. I made a dummy version of my dataset with a random set of
10 taxa being recorded as NA for the dependent variable, but adding the
corrections for signal still doesn't seem to be working. Adding the
corrections for signal produces identical error rates, and in general OLS
seems to produce more accurate estimates, which is weird given how large
lambda is. See code below along with attached data (though I reran this
with multiple seeds to see if I got the same results).

library(ape);library(nlme);library(caper);library(tidyverse)
set.seed(20210703)

dat<-read.csv("data.csv",header=T)
row.names(dat)<-dat$species
tree<-read.tree("tree.tre")

fit.ols<-lm(log(bm)~I(log(ocw)^(2/3)),data=dat)

dat$species2=dat$species
dat.comp<-comparative.data(tree,dat,names.col=species2)
dat.comp$data$bm[sample(length(dat.comp$data$bm), 10)] <- NA

dat.comp$data[is.na(dat.comp$data$bm)==TRUE,]

tree.trimmed <- drop.tip(tree, which(is.na(dat.comp$data$bm)))
dat.comp$data.trimmed <- dat.comp$data[complete.cases(dat.comp$data$bm),]
gls(log(bm)~I(log(ocw)^(2/3)),
           correlation=corPagel(0.5, phy=tree.trimmed, form=~species),
           
data=dat%>%filter(species!=dat.comp$data[is.na(dat.comp$data$bm)==TRUE,]$species))
#For some reason tree.trimmed does not work even though the two groups
appear to have the same tips.
gls(log(bm)~I(log(ocw)^(2/3)),
           correlation=corPagel(0.5, phy=tree, form=~species),
           data=dat)
fit <- gls(log(bm)~I(log(ocw)^(2/3)), correlation=corPagel(0.5, phy=tree,
form=~species),
           data=dat.comp$data.trimmed)
(lambda <- as.numeric(fit$modelStruct))

model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=dat.comp$data) ##
design matrix for model

missvals <- which(is.na(dat.comp$data$bm))
nmiss <- length(missvals)
preds <- predict(fit, newdata=dat.comp$data[missvals,]) ## get predictions
from model, assuming species are at the root.
mat <- vcv.phylo(tree)
C <- (lambda*mat)+(1-lambda)*diag(diag(mat)) ## Lambda correction

CCompleteData <- C[-missvals, -missvals] ## remove the missing species
cihmat <- C[missvals, -missvals] ## get the relationships of the "new"
species to the previous ones.

XnoMean <- scale(model.matFULL[-missvals,], scale=FALSE)

(mu <- cihmat %*% solve(CCompleteData) %*%  XnoMean)

preds+mu[,2]
(data.check<-data.frame(actual.name=dat.comp$data$species[missvals],
      actual=c(dat$bm[dat$species==row.names(mu)[1]],
               dat$bm[dat$species==row.names(mu)[2]],
               dat$bm[dat$species==row.names(mu)[3]],
               dat$bm[dat$species==row.names(mu)[4]],
               dat$bm[dat$species==row.names(mu)[5]],
               dat$bm[dat$species==row.names(mu)[6]],
               dat$bm[dat$species==row.names(mu)[7]],
               dat$bm[dat$species==row.names(mu)[8]],
               dat$bm[dat$species==row.names(mu)[9]],
               dat$bm[dat$species==row.names(mu)[10]]),
      predicted.ols=c(exp(fit.ols$fitted.values[row.names(mu)[1]]),
                      exp(fit.ols$fitted.values[row.names(mu)[2]]),
                      exp(fit.ols$fitted.values[row.names(mu)[3]]),
                      exp(fit.ols$fitted.values[row.names(mu)[4]]),
                      exp(fit.ols$fitted.values[row.names(mu)[5]]),
                      exp(fit.ols$fitted.values[row.names(mu)[6]]),
                      exp(fit.ols$fitted.values[row.names(mu)[7]]),
                      exp(fit.ols$fitted.values[row.names(mu)[8]]),
                      exp(fit.ols$fitted.values[row.names(mu)[9]]),
                      exp(fit.ols$fitted.values[row.names(mu)[10]])),
      predicted.names=dat.comp$data[missvals,]$species,
      predicted.orig=exp(preds),mu.names=row.names(mu),
      predicted.corr=exp(preds+mu[,2]))%>%
  mutate(across(where(is.numeric), round, 2)))
mean(abs(data.check$predicted.ols-data.check$actual)/data.check$predicted.ols)
#%PE with corrected OLS
mean(abs(data.check$predicted.orig-data.check$actual)/data.check$predicted.orig)
#%PE with uncorrected PGLS
mean(abs(data.check$predicted.corr-data.check$actual)/data.check$predicted.corr)
#%PE with corrected PGLS

Sincerely,
Russell


On Fri, Jul 2, 2021 at 8:52 PM Simone Blomberg <s.blombe...@uq.edu.au>
wrote:

> I knew there would be a mistake somewhere. Here is the correction:
>
> ## C <- 1+(lambda-1)*(1-mat) ## Lambda correction WRONG!!
> C <- (lambda*mat)+(1-lambda)*diag(diag(mat)) ## correct, I think..
>
> I tested it on a toy example of a tree with 5 species and it works fine..
>
> I hope there are no other mistakes! See other comments below.
>
> Cheers,
>
> Simone.
> On 3/7/21 8:40 am, Russell Engelman wrote:
>
> Dear All,
>
> Okay, so I tried the code that Dr. Blomberg provided, and it doesn't seem
> to work. I rewrote the code slightly in order for it to try and predict the
> body mass of the taxa used to calculate the line, rather than some new
> taxon, because I need to do that in order to calculate prediction error and
> be able to compare it to previous studies. I got a lambda of about 0.9,
> which is approximately the same as what I got previously. So on the surface
> it seems like the method should work.
>
> The problem comes when I try to apply this to the data itself to calculate
> mu. This is the modified code I used if that helps, though I don't think I
> can send the data or tree to make it replicable because the message
> wouldn't go through last time.
>
> tree.trimmed <- drop.tip(trees[[1]], which(is.na(b$bm)))
> b.trimmed <- b[complete.cases(b$bm),]
> fit <- gls(log(bm)~I(log(ocw)^(2/3)), corPagel(0.5, phy=tree.trimmed,
> form=~species),
>            data=b.trimmed)
> (lambda <- as.numeric(fit$modelStruct))
> model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=b)
> preds <- predict(fit,newdata=b)
> mat <- vcv.phylo(trees[[1]])
> C <- 1+(lambda-1)*(1-mat) ## Lambda correction
> XnoMean <- scale(model.matFULL, scale=FALSE)
> mu <- C %*% solve(C) %*%  XnoMean ## Replaced CCompletedata and cihmat
> with C, since both seem to be subsets of that data. This is the only place
> I could think things might go wrong.
>
> C is the full variance-covariance matrix, made from the tree that includes
> both the known and unknown species. CCompletedata is the
> variance-covariance matrix WITHOUT the unknown species. This corresponds to
> C in Ted and Tony's paper.  cihmat is the matrix of relationships of the
> unknown species to the known species, one row for each unknown species. So
> you cannot replace cihmat with C.
>
> mu2<-(mu[match(b$species,row.names(mu)),])[,2] ## Have to re-sort mu based
> on the data sheet. Mu is reported in phylogenetic order, whereas
> predict.gls will always report the data in alphabetical order of the row
> names even if a phylogeny is present.
>
> predict.gls reports the predicted values in the order that is present in
> the newdata argument. You shouldn't have to re-order anything in my example
> code. YMMV however if you adapt my code.
>
> exp(preds+mu2) ## "correct" prediction accounting for bias due to
> correlations with other species. Exp() is because data were log-transformed.
>
> So when I check the results of this equation the predicted values are
> actually *less *accurate than the fitted values reported from gls without
> correction. Normally the accuracy is about +/- 64% of the actual value, but
> adding the correction here results in the data being +/- 84% of the actual
> value. What's more, I took a look at the values of mu and they don't seem
> to make sense when compared on the phylogeny. Namely, if mu were being
> calculated right, one would expect monotremes to have a very negative value
> compared to all other mammals given the deep divergence in trait states
> between monotremes and all other taxa in the figure I previously attached.
> But when I check the mu values, I get a value of 0.35 for *Zaglossus*,
> -0.58 for *Tachyglossus*, and 0.59 for *Ornithorhynchus*. Which is not
> what would be expected if mu were modelling a phylogenetic correction
> factor (if nothing else, the range of mu should probably not span zero). So
> it suggests to me I'm not doing something right. I know that PGLS doesn't
> always result in better fits than OLS, but I'm surprised adding signal back
> in made the results worse when the fit was so poor when it was fitted
> through the ancestral node.
>
> Sincerely,
> Russell
>
> --
> Simone Blomberg, BSc (Hons), PhD, MAppStat (she/her)
> Senior Lecturer and Consultant Statistician
> School of Biological Sciences
> The University of Queensland
> St. Lucia Queensland 4072
> Australia
> T: +61 7 3365 2506
> email: S.Blomberg1_at_uq.edu.au
> Twitter: @simoneb66
> UQ ALLY Supporting the diversity of sexuality and gender at UQ.
>
> Policies:
> 1.  I will NOT analyse your data for you.
> 2.  Your deadline is your problem.
>
> Basically, I'm not interested in doing research
> and I never have been. I'm interested in
> understanding, which is quite a different thing.
> And often to understand something you have to
> work it out for yourself because no one else
> has done it. - David Blackwell
>
>

tree.tre
Description: Binary data

data.csv
Description: MS-Excel spreadsheet

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to