Re: [R-sig-phylo] Model Selection and PGLS

Theodore Garland Sun, 04 Jul 2021 10:10:42 -0700

Simone, thanks for sending that!
As a note of caution for users, make sure you compute PREDICTION INTERVALS,
not confidence intervals.  The former are generally much wider.
Cheers,
Ted




On Fri, Jul 2, 2021 at 1:55 AM Simone Blomberg <s.blombe...@uq.edu.au>
wrote:

> Hi all,
>
> Here is some example R code that implements the formula in Ted and
> Tony's 2000 paper. I haven't tested it thoroughly, so caveat emptor. I
> did check that there is no bias due to phylogenetic effects if you use a
> star phylogeny, which gives me some comfort. I haven't bothered with
> computing the prediction intervals, but it is pretty straight forward I
> think.
>
> Best,
> Simone.
>
> library(ape)
> library(nlme)
> library(MASS)
> set.seed(182347)
> tree <- compute.brlen(rtree(50))
> x <- rnorm(50)
> y <- 3*x+mvrnorm(mu=rep(0, 50), Sigma=vcv.phylo(tree)) ## add  physig to y
> y[sample(length(y), 4)] <- NA ## create some missing values in y to predict
> dat <- data.frame(x=x, y=y, species=tree$tip.label)
>
> tree.trimmed <- drop.tip(tree, which(is.na(dat$y))) ## say we first do
> the analysis on the non-missing cases
> dat.trimmed <- dat[complete.cases(dat),]
> fit <- gls(y~x, correlation=corPagel(0.5, phy=tree.trimmed, form=~species),
>             data=dat.trimmed)
> lambda <- as.numeric(fit$modelStruct) ## will assume the same lambda
> applies to missing and nonmissing cases. I'm not sure this is reasonable?
>
> ## lambda
> ## [1] 1.003751
>
> model.matFULL <- model.matrix(~x, data=dat) ## design matrix for model
>
> missvals <- which(is.na(dat$y))
> nmiss <- length(missvals)
> preds <- predict(fit, newdata=dat[missvals,]) ## get predictions from
> model, assuming species are at the root.
> mat <- vcv.phylo(tree)
> C <- 1+(lambda-1)*(1-mat) ## Lambda correction
>
> CCompleteData <- C[-missvals, -missvals] ## remove the missing species
> cihmat <- C[missvals, -missvals] ## get the relationships of the "new"
> species to the previous ones.
>
> XnoMean <- scale(model.matFULL[-missvals,], scale=FALSE) ## centre the
> model matrix
>
> mu <- cihmat %*% solve(CCompleteData) %*%  XnoMean ## formula from Ted
> and Tony's 2000 paper, appendix B, last paragraph.
>
> preds
> ##  [1] -2.440567 -2.370682 -1.035975 -2.583952
>
>   mu
> ##     (Intercept)          x
> ## t26           0 -0.2208723
> ## t27           0  0.5918446
> ## t43           0 -1.1822058
> ## t39           0 -0.8706877
>
> preds+mu[,2] ## "correct" prediction accounting for bias due to
> correlations with other species.
>
>   ##      t26       t27       t43       t39
> ## -2.661439 -1.778837 -2.218181 -3.454639
>
> On 1/7/21 3:09 pm, Theodore Garland wrote:
> > Russell,
> > Please read this paper:
> > https://pubmed.ncbi.nlm.nih.gov/10718731/
> > Cheers
> > Ted
> >
> >
> > On Wed, Jun 30, 2021, 9:21 PM Russell Engelman <neovenatori...@gmail.com
> >
> > wrote:
> >
> >> Dear All,
> >>
> >> What you see is the large uncertainty in “ancestral” states, which is
> part
> >>> of the intercept here. The linear relationship that you overlaid on
> top of
> >>> your data is the relationship predicted at the root of the tree (as if
> such
> >>> a thing existed!). There is a lot of uncertainty about the intercept,
> but
> >>> much less uncertainty in the slope. It looks like the slope is not
> affected
> >>> by the inclusion or exclusion of monotremes. (for one possible
> reference on
> >>> the greater precision in the slope versus the intercept, there’s this:
> >>> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >>
> >> Yes, that sounds right from the other data I have. The line approximates
> >> what would be expected for the root of Mammalia, and the signal in the
> PGLS
> >> is more due to shifts in the y-intercept than shifts in slope, which in
> >> turn is supported by the anatomy of the proxy.
> >>
> >> My second cent is that the phylogenetic predictions should be stable.
> The
> >>> uncertainty in the intercept —and the large effect of including
> monotremes
> >>> on the intercept— should not affect predictions, so long as you know
> for
> >>> which species you want to make a prediction. If you want to make
> prediction
> >>> for a species in a small clade “far” from monotremes, say, then the
> >>> prediction is probably quite stable, even if you include monotremes:
> this
> >>> is because the phylogenetic prediction should use the phylogenetic
> >>> relationships for the species to be predicted. A prediction that uses
> the
> >>> linear relationship at the root and ignores the placement of the
> species
> >>> would be the worst-case scenario: for a mammal species with a
> completely
> >>> unknown placement within mammals.
> >>
> >> This is what I'm a bit confused about. I was always told (and it
> seemingly
> >> implies this in some of the PGLS literature I read like Rohlf 2011 and
> >> Smaers and Rohlf 2016) that it isn't possible to include phylogenetic
> data
> >> from the new data points into the prediction in order to improve
> >> predictions. I'm a little confused as to whether it's possible or not
> (see
> >> below).
> >>
> >> There’s probably a number of software that do phylogenetic prediction. I
> >>> know of Rphylopars and PhyloNetworks.
> >>
> >> I will take a look into those.
> >>
> >> I think that Cécile' and Theodore' point is important and too often
> >>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased
> Prediction) is
> >>> not simply obtained from the fitted line but should incorporates
> >>> information from the (evolutionary here) model.
> >>
> >>       There’s a way to impute phylogenetic signal back into a PGLS
> model? I
> >> am super surprised at that. I’ve talked to at least three different
> >> colleagues who use PGLS about this issue, and all of them had told me
> that
> >> there is no way to input phylogenetic signal back into the model for new
> >> data points and I should just go with the single regression line the
> model
> >> gives me (i.e., the regression line for the ancestral node).
> >>
> >>       I tried looking around to see what previous researchers used when
> >> using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
> >> Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just
> went
> >> with the best fit line with the ancestral node, i.e., looking at their
> >> reported results they give a simple trait~predictor equation that does
> not
> >> include phylogeny when calculating new data. Campion and Evans 2012 used
> >> PIC versus PGLS, which I know are technically equivalent but it doesn't
> >> seem like they included phylogenetic information when they predicted new
> >> data: they used their equations on dinosaurs but there are no dinosaurs
> in
> >> the tree they used. I know that it’s possible to incorporate
> phylogenetic
> >> signal into the new data using PVR but PVR has been criticized for other
> >> reasons.
> >>
> >>      This is something that seems really, really concerning because if
> >> there is a method of using phylogenetic covariance to adjust the
> position
> >> of new data points it seems like a lot of workers don’t know these
> methods
> >> exist, to the point that even published papers overlook it. This was
> >> something I was hoping to highlight in a later paper on the data, but it
> >> sounds like people might have discussed it already. I remember talking
> with
> >> my colleagues a lot about "isn't there some way to incorporate
> phylogenetic
> >> information back into the model to improve accuracy of the prediction
> if we
> >> know where the taxon is positioned?" and they just thought there wasn't
> a
> >> way.
> >>
> >> Regarding the model comparison, I would simply avoid it (or limit it) by
> >>> fitting models flexible enough to accommodate between your BM and OLS
> case
> >>> and summarize the results obtained across all the trees…
> >>
> >> I am not entirely sure what is meant here. Do you mean fitting both an
> OLS
> >> and BM model and comparing both models? I am reporting both, but my
> concern
> >> is about which model I report is the best one to use going forward,
> since
> >> the BM model is seemingly less accurate (though I am just taking the
> fitted
> >> values from the PGLS model, which I don't think include phylogenetic
> >> information). The two models I use produce dramatically different
> results,
> >> for example the BM model produces body mass estimates which are 25%
> larger
> >> than OLS.
> >>
> >> Right now PGLS is something I would avoid if I had the option (if for no
> >> other reason than not put all of the analyses in a single, overloaded
> >> manuscript [the manuscript is already about 90 pages] and deviate from
> the
> >> scope of the study), but I'm sure you know that most regression analyses
> >> nowadays require some sort of preliminary PCM to be acceptable.
> >>
> >> Sincerely,
> >> Russell
> >>
> >> On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel <
> julien.cla...@hotmail.fr>
> >> wrote:
> >>
> >>> I think that Cécile' and Theodore' point is important and too often
> >>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased
> Prediction) is
> >>> not simply obtained from the fitted line but should incorporates
> >>> information from the (evolutionary here) model.
> >>>
> >>> For multivariate linear model you can also do it by specifying a tree
> >>> including both the species used to build the model and the ones you
> want to
> >>> predict using the “predict” function in mvMORPH (I think that
> Rphylopars
> >>> can deal with multivariate phylogenetic regression too).
> >>>
> >>> Regarding the model comparison, I would simply avoid it (or limit it)
> by
> >>> fitting models flexible enough to accommodate between your BM and OLS
> case
> >>> and summarize the results obtained across all the trees…
> >>>
> >>> Julien
> >>>
> >>>
> >>> De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de
> >>> Theodore Garland <theodore.garl...@ucr.edu>
> >>> Envoyé : mercredi 30 juin 2021 03:26
> >>> À : Cecile Ane <cecile....@wisc.edu>
> >>> Cc : mailman, r-sig-phylo <r-sig-phylo@r-project.org>;
> >>> neovenatori...@gmail.com <neovenatori...@gmail.com>
> >>> Objet : Re: [R-sig-phylo] Model Selection and PGLS
> >>>
> >>> All true.  I would just add two things.  First, always graph your data
> and
> >>> do ordinary OLS analyses as a reality check.
> >>>
> >>> Second, I think this is the original paper for phylogenetic prediction:
> >>> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
> >>> present: confidence intervals for regression equations in phylogenetic
> >>> comparative methods. American Naturalist 155:346–364.
> >>> There, we talk about the Equivalency of the Independent-Contrasts and
> >>> Generalized Least Squares Approaches.
> >>>
> >>> Cheers,
> >>> Ted
> >>>
> >>>
> >>> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane <cecile....@wisc.edu>
> wrote:
> >>>
> >>>> Hi Russel,
> >>>>
> >>>> What you see is the large uncertainty in “ancestral” states, which is
> >>> part
> >>>> of the intercept here. The linear relationship that you overlaid on
> top
> >>> of
> >>>> your data is the relationship predicted at the root of the tree (as if
> >>> such
> >>>> a thing existed!). There is a lot of uncertainty about the intercept,
> >>> but
> >>>> much less uncertainty in the slope. It looks like the slope is not
> >>> affected
> >>>> by the inclusion or exclusion of monotremes. (for one possible
> >>> reference on
> >>>> the greater precision in the slope versus the intercept, there’s this:
> >>>> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >>>>
> >>>> My second cent is that the phylogenetic predictions should be stable.
> >>> The
> >>>> uncertainty in the intercept —and the large effect of including
> >>> monotremes
> >>>> on the intercept— should not affect predictions, so long as you know
> for
> >>>> which species you want to make a prediction. If you want to make
> >>> prediction
> >>>> for a species in a small clade “far” from monotremes, say, then the
> >>>> prediction is probably quite stable, even if you include monotremes:
> >>> this
> >>>> is because the phylogenetic prediction should use the phylogenetic
> >>>> relationships for the species to be predicted. A prediction that uses
> >>> the
> >>>> linear relationship at the root and ignores the placement of the
> species
> >>>> would be the worst-case scenario: for a mammal species with a
> completely
> >>>> unknown placement within mammals.
> >>>>
> >>>> There’s probably a number of software that do phylogenetic
> prediction. I
> >>>> know of Rphylopars and PhyloNetworks.
> >>>>
> >>>> my 2 cents…
> >>>> Cecile
> >>>>
> >>>> ---
> >>>> Cécile Ané, Professor (she/her)
> >>>> H. I. Romnes Faculty Fellow
> >>>> Departments of Statistics and of Botany
> >>>> University of Wisconsin - Madison
> >>>> www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>
> >>>>
> >>>> CALS statistical consulting lab:
> >>>> https://calslab.cals.wisc.edu/stat-consulting/
> >>>>
> >>>>
> >>>>
> >>>> On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com<mailto:
> >>>> neovenatori...@gmail.com> wrote:
> >>>>
> >>>> Dear All,
> >>>>
> >>>> So this is the main problem I'm facing (see attached figure, which
> >>> should
> >>>> be small enough to post). When I calculate the best-fit line under a
> >>>> Brownian model, this produces a best-fit line that more or less
> bypasses
> >>>> the distribution of the data altogether. I did some testing and found
> >>> that
> >>>> this result was driven solely by the presence of Monotremata,
> resulting
> >>> in
> >>>> the model heavily downweighting all of the phylogenetic variation
> within
> >>>> Theria in favor of the deep divergence between Monotremata and Theria.
> >>>> Excluding Monotremata produces a PGLS fit that's comparable enough to
> >>> the
> >>>> OLS and OU model fit to be justifiable (though I can't just throw out
> >>>> Monotremata for the sake of throwing it out).
> >>>>
> >>>> I am planning to do a more theoretical investigation into the effect
> of
> >>>> Monotremata on the PGLS fit in a future study, but right now what I am
> >>>> trying to do is perform a study in which I use this data to construct
> a
> >>>> regression model that can be used to predict new data. Which is why I
> am
> >>>> trying to use AIC to potentially justify going with OLS or an OU model
> >>> over
> >>>> a Brownian model. From a practical perspective the Brownian model is
> >>> almost
> >>>> unusable because it produces systematically biased estimates with high
> >>>> error rates when applied to new data (error rate is roughly double
> that
> >>> of
> >>>> both the OLS and OU model). This is especially the case because the
> data
> >>>> must be back-transformed into an arithmetic scale to be useable, and
> >>> thus a
> >>>> seemingly minor difference in regression models results in a massive
> >>>> difference in predicted values. However, I need some objective test to
> >>> show
> >>>> that OLS fits the data better than the Brownian model, hence why I was
> >>>> going with AIC. Overall, OLS does seem to outperform the Brownian
> model
> >>> on
> >>>> average, but the variation in AIC is so high it is hard to interpret
> >>> this.
> >>>> This is kind of why I am leery of assuming a null Brownian model. A
> >>>> Brownian model, if anything, does not seem to accurately model the
> >>>> relationship between variables.
> >>>>
> >>>> This is why I am having trouble figuring out how to do model
> selection.
> >>>> Just going with accuracy statistics like percent error or standard
> >>> error of
> >>>> the estimate OLS is better from a purely practical sense (it doesn't
> >>> work
> >>>> for the monotreme taxa, but it turns out that estimate error in the
> >>>> monotremes is only decreased by 10% in a Brownian model when it
> >>>> overestimates mass by nearly 75%, so the improvement really isn't
> worth
> >>> it
> >>>> and using this for monotremes isn't recommended in the first place),
> but
> >>>> the reviewers are expressing skepticism over the fact that the
> Brownian
> >>>> model produces less useable results. And I'm not entirely sure the
> best
> >>> way
> >>>> to go about the PGLS if using one of the birth-death trees isn't
> ideal,
> >>>> perhaps what Dr. Upham says about using the DNA tree might work
> better.
> >>>>
> >>>> Ironically, an OU model might be argued to better fit the data,
> despite
> >>>> the concerns that Dr. Bapst mentioned. Looking at the distribution of
> >>>> signal even though signal is not random, it is more accurately
> >>> described as
> >>>> most taxa hewing to a stable equilibrium with rapid, high magnitude
> >>> shifts
> >>>> at certain evolutionary nodes, rather than the covariation between the
> >>> two
> >>>> traits evolving in a Brownian fashion. I did some experiments with a
> PSR
> >>>> curve and the results seem to favor an OU model or other models with
> >>> uneven
> >>>> rates of evolution rather than a pure Brownian model.
> >>>>
> >>>> Of course, the broader issue I am facing is trying to deal with PGLS
> >>>> succinctly; the scope of the study isn't necessarily an in-depth
> >>> comparison
> >>>> between different regression models, it's more looking at how this
> >>> variable
> >>>> correlates with body mass for practical purposes (for which
> considering
> >>>> phylogeny is one part of that). It's definitely something to consider
> >>> but I
> >>>> am trying to avoid manuscript bloat.
> >>>>
> >>>> Sincerely,
> >>>> Russell
> >>>>
> >>>>
> >>>>          [[alternative HTML version deleted]]
> >>>>
> >>>> _______________________________________________
> >>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> >>>> Searchable archive at
> >>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/
> >>>>
> >>>          [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> >>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> >>> Searchable archive at
> >>> http://www.mail-archive.com/r-sig-phylo@r-project.org/
> >>
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-phylo mailing list - R-sig-phylo@r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> > Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>
> --
> Simone Blomberg, BSc (Hons), PhD, MAppStat (she/her)
> Senior Lecturer and Consultant Statistician
> School of Biological Sciences
> The University of Queensland
> St. Lucia Queensland 4072
> Australia
> T: +61 7 3365 2506
> email: S.Blomberg1_at_uq.edu.au
> Twitter: @simoneb66
> UQ ALLY Supporting the diversity of sexuality and gender at UQ.
>
> Policies:
> 1.  I will NOT analyse your data for you.
> 2.  Your deadline is your problem.
>
> Basically, I'm not interested in doing research
> and I never have been. I'm interested in
> understanding, which is quite a different thing.
> And often to understand something you have to
> work it out for yourself because no one else
> has done it. - David Blackwell
>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to