Re: [R-sig-phylo] Model Selection and PGLS

David Bapst Tue, 29 Jun 2021 12:45:19 -0700

Hi all, Russell, Nate,

I took an interest with some of the commentary here, so here's my two cents.


> I was surprised that AIC would vary this much in a dataset where the trait 
> data, number of tips, and branching
> topology used to compute the model are more or less constant between trees.

I am not surprised to hear of substantial variation in model support.
In my experience, model support can vary wildly among trees that share
the same taxa and the overall structure (they'd all more or less look
like the same tree if printed on a conference poster without
tip-labels). Most of the information is constrained by where the short
edges are in the tree, especially short terminal edges that connect to
tips. Varying which terminal edges are very short, or varying their
placement can produce a lot of variation in which model is best
supported by any information criterion.

All of this especially applies when we consider the OU (single optima)
model among our set of models, which is like a great big sponge that
likes to eat noise, but also applies just when considering between BM
or a signal-less white noise (OLS) model, or when evaluating really
any test that tries to ascertain if there is phylogenetic signal or
not.

And this all follows naturally. Obviously, if we are uncertain about
the timing of recent evolutionary divergences, we are uncertain how
rapid evolution has been recently. If recent change is very rapid,
then it will look like rates are accelerating (ala OU model) and/or
there is very poor phylogenetic signal (because old divergences in
trait-space do not correspond to the larger differences in the
dataset).

> ...given that it is logical to expect AIC to vary based on differences 
> between trees, how would one go about
> determining which regression model is the "optimal" one to use for further 
> analysis..?

I think this is a place where an unguided examination of the traits in
question cannot go any further. Frankly, you need to interrogate what
your expectation is: do you think these traits have phylogenetic
signal? Do they not? Looking at the trees, do you have any sense in
what edges, and what edge lengths are controlling the variation you
are seeing in model support?

I would not state it in the same null-model terms that Nate uses,
because referring to the PGLS (BM) as the null model means that in
some ways we are making implicit assumptions about what our default
assumptions are. I think it would be better to more carefully consider
what our expectations for this particular system is. Do you think
descendant populations are generally like their ancestors over
evolutionary time? Do you have some reason for thinking otherwise?

I would not give the OU model much consideration, unless you wish to
use it as a description of a phylogenetic process with decreased
phylogenetic signal (as Carl Boettiger once advocated in this eight
year old blog post that I still tell people to read:
https://www.carlboettiger.info/2013/10/11/is-it-time-to-retire-pagels-lambda.html).

Ultimately, though, it sounds like what you will find is a mixed
answer, so why not just take a model that allows phylogenetic signal
to scale from non-existent to very strong (like the lambda PGLS model
or the OU PGLS model), fit it to each tree in a large set, and then
you can look at how other model parameters vary with the apparent
strength of phylogenetic signal? Perhaps this may be more informative
to you than choosing a model based on a wildly varying information
criterion, or getting a 'yes' or 'no' to whether phylogenetic signal
exists in your data.

Cheers,
-Dave

On Tue, Jun 29, 2021 at 2:05 PM Nathan Upham <nathan.up...@asu.edu> wrote:
>
> Hi Russell and all, sounds good.
>
> I’d suggest that the “null model” for fitting trait data to a phylogeny 
> should be single-rate Brownian motion, i.e., you’re assuming that given data 
> on the ancestor-to-descendant relationships of the species (and timing of 
> divergences), and assuming the trait is heritable, it is evolving at the same 
> random rate along each branch.  The burden of proof is on rejecting that null 
> hypothesis (not “accepting it”— sorry for earlier writing that incorrectly!). 
>  So if you do your AIC fitting across the 100 trees, summarize the results, 
> and find no clear signal of a model being obviously better than single-rate 
> Brownian, then that is what you should use for subsequent analyses.
>
> If anyone has a different perspective on this, please chime in.  The above 
> assumes that you’ve established heritability of the trait, for example by 
> doing a test for phylogenetic ’signal’.
>
> Does that help?  All the best
> —nate
>
>
>
> > On Jun 28, 2021, at 1:25 PM, Russell Engelman <neovenatori...@gmail.com> 
> > wrote:
> >
> > Dear Dr. Upham (and All),
> >
> > Please don't take my initial message the wrong way, this is not meant to be 
> > a dig at your 2019 study. I don’t think this is due to the birth-death tree 
> > specifically but would be present in any study where there are multiple 
> > phylogenetic trees to choose from or some measure of uncertainty in the tip 
> > dates. I definitely agree with you that there is almost certainly going to 
> > be variation in model support values if there is any difference in the 
> > underlying phylogeny, however, I was surprised that AIC would vary this 
> > much in a dataset where the trait data, number of tips, and branching 
> > topology used to compute the model are more or less constant between trees.
> >
> > My question is more along the lines of "given that it is logical to expect 
> > AIC to vary based on differences between trees, how would one go about 
> > determining which regression model is the "optimal" one to use for further 
> > analysis"? You mentioned taking the 95% confidence intervals of the models 
> > and seeing if they don't overlap, would this be just taking the singular 
> > AIC from the OLS model and comparing it to the PGLS one, since OLS 
> > seemingly doesn't produce a confidence interval of AIC values? And if the 
> > confidence intervals do overlap, is the OLS  or PGLS considered the null 
> > hypothesis? In my case the AIC for OLS is within the 95% confidence 
> > intervals for the PGLS, but is much lower than the mean value (it's close 
> > to the lower first standard deviation of the AIC values).
> >
> > Sincerely,
> > Russell
> >
> > On Mon, Jun 28, 2021 at 2:46 PM Nathan Upham <nathan.up...@asu.edu 
> > <mailto:nathan.up...@asu.edu>> wrote:
> > Hi Russell and all:
> >
> > I’ll respond here since the answer is related to the intended purpose of 
> > the VertLife mammal trees — i.e, capturing full uncertainty in node ages 
> > and phylogenetic relationships was one of the motivators for building the 
> > mammal trees in the way we did.  This approach contrasts to wanting to 
> > obtain the single “best tree”, since methods of phylogenetic reconstruction 
> > will always just be approximations of the “true tree” anyway rather than 
> > ever being equal to that tree.  To only use a single consensus tree in 
> > comparative phylogenetic analyses assumes that we know the true tree, which 
> > again, we don’t ever in an empirical context (only for simulations).  Those 
> > points were summarized well by Huelsenbeck et al. (2000: 
> > http://science.sciencemag.org/content/288/5475/2349 
> > <https://urldefense.com/v3/__http://science.sciencemag.org/content/288/5475/2349__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJczwsIVKQ$>),
> >  but nevertheless are still not standard practice in PCMs.
> >
> > To the point of AIC varying across the 100 trees, this is to be expected.  
> > Any 1 tree of 100 trees from the credible set is not very meaningful; the 
> > entire 100 trees need to be analyzed and then the estimate +/- SE from each 
> > tree can be summarized as a distribution of values.  If the 95% CI on the 
> > distribution of values excludes your hypothesis, then you’ve learned 
> > something; if not, you accept the null hypothesis.  See the animated gifs 
> > here (http://vertlife.org/data/mammals/ 
> > <https://urldefense.com/v3/__http://vertlife.org/data/mammals/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdB22hcow$>)
> >  for a better conception of why this phylogenetic uncertainty is important 
> > to consider when doing model fitting or other PCMs.
> >
> > That said, if a single ‘best tree’ is the target, then the DNA-only MCC 
> > tree of 4098 species is a reasonable thing to analyze, more analogous to 
> > how mainstream phylogenetics has presented trees for re-use 
> > (https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre
> >  
> > <https://urldefense.com/v3/__https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJc5io9AXA$>).
> >   But again, while the MCC tree is appropriate, 1 of 100 trees from the 
> > credible set is not.
> >
> > Hope that helps.  All the best,
> > —nate
> >
> >
> >
> > ==============================================================================
> > Nathan S. Upham, Ph.D. (he/him)
> > Assistant Research Professor & Associate Curator of Mammals
> > Arizona State University, School of Life Sciences, Biodiversity Knowledge 
> > Integration Center (BioKIC <https://biokic.asu.edu/>)
> >      ~> Check out the new Mammal Tree of Life 
> > <https://urldefense.com/v3/__http://vertlife.org/data/mammals/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdB22hcow$>
> >  and the Mammal Diversity Database 
> > <https://urldefense.com/v3/__https://mammaldiversity.org/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJeGe2v38Q$>
> >
> > Research Associate, Yale University (Ecology and Evolutionary Biology)
> > Research Associate, Field Museum of Natural History (Negaunee Integrative 
> > Research Center)
> > Chair, Biodiversity Committee, American Society of Mammalogists
> > Taxonomy Advisor, IUCN/SSC Small Mammal Specialist Group
> >
> > personal web: n8u.org 
> > <https://urldefense.com/v3/__http://n8u.org__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJeaYbzGDw$>
> >  | Google Scholar 
> > <https://urldefense.com/v3/__https://scholar.google.com/citations?hl=en&user=zIn4NoUAAAAJ&view_op=list_works&gmla=AJsN-F6ybkfthmTdjTpow6sgMhWKn1EKcfNtmIF_wzZcev7yeHuEu5_aolFS85rWiVRHpiQgbwg43i6eS6kArrabLdFL4bntzUSRmlRP2CW4lbZqeEcColw__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdLueKsKQ$>
> >  | ASU profile <https://isearch.asu.edu/profile/3682356>
> > e: nathan.up...@asu.edu <mailto:nathan.up...@asu.edu> | Skype: nate_upham | 
> > Twitter: @n8_upham 
> > <https://urldefense.com/v3/__https://twitter.com/n8_upham__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdnOqPBaw$>
> > =============================================================================
> >
> >
> >
> >> On Jun 28, 2021, at 10:47 AM, Russell Engelman <neovenatori...@gmail.com 
> >> <mailto:neovenatori...@gmail.com>> wrote:
> >>
> >> Dear R-Sig-Phylo Mailing List,
> >>
> >> I ran into a rather unusual problem. I was doing an analysis using the
> >> mammal trees from Upham et al. (2019) downloaded off of the VertLife site.
> >> The model statistics for my data initially suggested that the OLS model was
> >> better supported than a PGLS model based on Akaike Information Criterion
> >> (AIC). The reviewers for the paper wanted me to add more taxa, so I
> >> re-downloaded a set of trees from VertLife and reran the analysis, but when
> >> I did I found that suddenly the AIC values for the PGLS equation were
> >> dramatically different, to the point that it favored a Brownian PGLS model
> >> over all other models. This was despite the fact that previously I found
> >> that an OLS model and an OU model had a better model fit than a Brownian
> >> model, and the other accuracy statistics of interest (like percent error,
> >> this being a model intended for use in predicting new data) also found OLS
> >> and OU models to fit better than a Brownian PGLS model. The regression line
> >> for a Brownian model doesn't even fit the data at all due to being biased
> >> by a basal clade. The model also has a high amount of phylogenetic inertia
> >> which again would seemingly make an OU model a better option.
> >>
> >> I used drop.tip to remove the additional taxa to see if I could replicate
> >> my previous results, but it turns out I still couldn't replicate the
> >> results. That's when I realized what was causing the change in AIC values
> >> wasn't the taxon selection, but the tree I was using. If I used the old
> >> VertLife tree I could replicate the results, but the new VertLife tree
> >> produced radically different results despite using the same tips. So what I
> >> decided to do is rerun the analysis for all 100 trees I had available, and
> >> it turned out there was a massive amount of variation in AIC depending on
> >> what tree was chosen. I tried including an html data printout to show the
> >> precise results and how I got them, but I couldn't attach them because the
> >> mailer daemon kept saying they were too large. The AIC values between trees
> >> vary by almost 200 points after excluding extreme outliers, when model
> >> differences of 2 or more are often considered to represent statistically
> >> detectable differences. The unusually low AIC I got when I first ran the
> >> analysis happened to be because the first tree in the 100 trees merely
> >> happened to produce a lower-than-average AIC than the whole sample. The
> >> average AIC out of the 100 trees was higher than for the OLS model, which
> >> again makes sense given the distribution of the data.
> >>
> >> However, and this is where my problem comes in, how do I make appropriate
> >> model selections for PGLS if there is such a massive amount of variation in
> >> AIC? Especially given that between the trees in the sample there is enough
> >> variation that it can cause one model to be favored over another? Just
> >> picking one tree and going with that seems counterintuitive, because it's
> >> not very objective and theoretically someone could pick a specific tree to
> >> get the results they want, or accidentally pick a tree that might support
> >> the wrong model as seen here. On top of that the tree topologies are more
> >> or less identical: the same 404 taxa are present in all trees and the trees
> >> have nearly identical topologies, the only real differences between trees
> >> are branch lengths. But given this, how can I justify which AIC value I
> >> report, which in turn means which model is best supported?
> >>
> >> I did try looking at the phylo_lm function in the sensiphy package, but
> >> that function doesn't seem to provide any method of performing model
> >> selection between different regression models. It does seemingly report
> >> AIC, but the AIC the function reported was dramatically different from the
> >> aic I got using the gls function in ape and nlme.
> >>
> >> Sincerely,
> >> Russell
> >>
> >>      [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> R-sig-phylo mailing list - R-sig-phylo@r-project.org 
> >> <mailto:R-sig-phylo@r-project.org>
> >> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-sig-phylo__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpGvRdojLg$
> >>  
> >> <https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-sig-phylo__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpGvRdojLg$>
> >> Searchable archive at 
> >> https://urldefense.com/v3/__http://www.mail-archive.com/r-sig-phylo@r-project.org/__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpF9l3tadg$
> >>  
> >> <https://urldefense.com/v3/__http://www.mail-archive.com/r-sig-phylo@r-project.org/__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpF9l3tadg$>
> >
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to