Re: [R-sig-phylo] Model Selection and PGLS

Theodore Garland Tue, 29 Jun 2021 14:39:53 -0700

The other possible null model would be a "star" phylogeny with no
hierarchical structure, equal-length branches, and also Brownian motion.
But that's generally viewed as outside of the range of reasonable
possibilities.
Cheers,
Ted


On Tue, Jun 29, 2021 at 12:05 PM Nathan Upham <nathan.up...@asu.edu> wrote:

> Hi Russell and all, sounds good.
>
> I’d suggest that the “null model” for fitting trait data to a phylogeny
> should be single-rate Brownian motion, i.e., you’re assuming that given
> data on the ancestor-to-descendant relationships of the species (and timing
> of divergences), and assuming the trait is heritable, it is evolving at the
> same random rate along each branch.  The burden of proof is on rejecting
> that null hypothesis (not “accepting it”— sorry for earlier writing that
> incorrectly!).  So if you do your AIC fitting across the 100 trees,
> summarize the results, and find no clear signal of a model being obviously
> better than single-rate Brownian, then that is what you should use for
> subsequent analyses.
>
> If anyone has a different perspective on this, please chime in.  The above
> assumes that you’ve established heritability of the trait, for example by
> doing a test for phylogenetic ’signal’.
>
> Does that help?  All the best
> —nate
>
>
>
> > On Jun 28, 2021, at 1:25 PM, Russell Engelman <neovenatori...@gmail.com>
> wrote:
> >
> > Dear Dr. Upham (and All),
> >
> > Please don't take my initial message the wrong way, this is not meant to
> be a dig at your 2019 study. I don’t think this is due to the birth-death
> tree specifically but would be present in any study where there are
> multiple phylogenetic trees to choose from or some measure of uncertainty
> in the tip dates. I definitely agree with you that there is almost
> certainly going to be variation in model support values if there is any
> difference in the underlying phylogeny, however, I was surprised that AIC
> would vary this much in a dataset where the trait data, number of tips, and
> branching topology used to compute the model are more or less constant
> between trees.
> >
> > My question is more along the lines of "given that it is logical to
> expect AIC to vary based on differences between trees, how would one go
> about determining which regression model is the "optimal" one to use for
> further analysis"? You mentioned taking the 95% confidence intervals of the
> models and seeing if they don't overlap, would this be just taking the
> singular AIC from the OLS model and comparing it to the PGLS one, since OLS
> seemingly doesn't produce a confidence interval of AIC values? And if the
> confidence intervals do overlap, is the OLS  or PGLS considered the null
> hypothesis? In my case the AIC for OLS is within the 95% confidence
> intervals for the PGLS, but is much lower than the mean value (it's close
> to the lower first standard deviation of the AIC values).
> >
> > Sincerely,
> > Russell
> >
> > On Mon, Jun 28, 2021 at 2:46 PM Nathan Upham <nathan.up...@asu.edu
> <mailto:nathan.up...@asu.edu>> wrote:
> > Hi Russell and all:
> >
> > I’ll respond here since the answer is related to the intended purpose of
> the VertLife mammal trees — i.e, capturing full uncertainty in node ages
> and phylogenetic relationships was one of the motivators for building the
> mammal trees in the way we did.  This approach contrasts to wanting to
> obtain the single “best tree”, since methods of phylogenetic reconstruction
> will always just be approximations of the “true tree” anyway rather than
> ever being equal to that tree.  To only use a single consensus tree in
> comparative phylogenetic analyses assumes that we know the true tree, which
> again, we don’t ever in an empirical context (only for simulations).  Those
> points were summarized well by Huelsenbeck et al. (2000:
> http://science.sciencemag.org/content/288/5475/2349 <
> https://urldefense.com/v3/__http://science.sciencemag.org/content/288/5475/2349__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJczwsIVKQ$>),
> but nevertheless are still not standard practice in PCMs.
> >
> > To the point of AIC varying across the 100 trees, this is to be
> expected.  Any 1 tree of 100 trees from the credible set is not very
> meaningful; the entire 100 trees need to be analyzed and then the estimate
> +/- SE from each tree can be summarized as a distribution of values.  If
> the 95% CI on the distribution of values excludes your hypothesis, then
> you’ve learned something; if not, you accept the null hypothesis.  See the
> animated gifs here (http://vertlife.org/data/mammals/ <
> https://urldefense.com/v3/__http://vertlife.org/data/mammals/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdB22hcow$>)
> for a better conception of why this phylogenetic uncertainty is important
> to consider when doing model fitting or other PCMs.
> >
> > That said, if a single ‘best tree’ is the target, then the DNA-only MCC
> tree of 4098 species is a reasonable thing to analyze, more analogous to
> how mainstream phylogenetics has presented trees for re-use (
> https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre
> <
> https://urldefense.com/v3/__https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJc5io9AXA$>).
> But again, while the MCC tree is appropriate, 1 of 100 trees from the
> credible set is not.
> >
> > Hope that helps.  All the best,
> > —nate
> >
> >
> >
> >
> ==============================================================================
> > Nathan S. Upham, Ph.D. (he/him)
> > Assistant Research Professor & Associate Curator of Mammals
> > Arizona State University, School of Life Sciences, Biodiversity
> Knowledge Integration Center (BioKIC <https://biokic.asu.edu/>)
> >      ~> Check out the new Mammal Tree of Life <
> https://urldefense.com/v3/__http://vertlife.org/data/mammals/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdB22hcow$>
> and the Mammal Diversity Database <
> https://urldefense.com/v3/__https://mammaldiversity.org/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJeGe2v38Q$
> >
> >
> > Research Associate, Yale University (Ecology and Evolutionary Biology)
> > Research Associate, Field Museum of Natural History (Negaunee
> Integrative Research Center)
> > Chair, Biodiversity Committee, American Society of Mammalogists
> > Taxonomy Advisor, IUCN/SSC Small Mammal Specialist Group
> >
> > personal web: n8u.org <
> https://urldefense.com/v3/__http://n8u.org__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJeaYbzGDw$>
> | Google Scholar <
> https://urldefense.com/v3/__https://scholar.google.com/citations?hl=en&user=zIn4NoUAAAAJ&view_op=list_works&gmla=AJsN-F6ybkfthmTdjTpow6sgMhWKn1EKcfNtmIF_wzZcev7yeHuEu5_aolFS85rWiVRHpiQgbwg43i6eS6kArrabLdFL4bntzUSRmlRP2CW4lbZqeEcColw__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdLueKsKQ$>
> | ASU profile <https://isearch.asu.edu/profile/3682356>
> > e: nathan.up...@asu.edu <mailto:nathan.up...@asu.edu> | Skype:
> nate_upham | Twitter: @n8_upham <
> https://urldefense.com/v3/__https://twitter.com/n8_upham__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdnOqPBaw$>
>
> >
> =============================================================================
> >
> >
> >
> >> On Jun 28, 2021, at 10:47 AM, Russell Engelman <
> neovenatori...@gmail.com <mailto:neovenatori...@gmail.com>> wrote:
> >>
> >> Dear R-Sig-Phylo Mailing List,
> >>
> >> I ran into a rather unusual problem. I was doing an analysis using the
> >> mammal trees from Upham et al. (2019) downloaded off of the VertLife
> site.
> >> The model statistics for my data initially suggested that the OLS model
> was
> >> better supported than a PGLS model based on Akaike Information Criterion
> >> (AIC). The reviewers for the paper wanted me to add more taxa, so I
> >> re-downloaded a set of trees from VertLife and reran the analysis, but
> when
> >> I did I found that suddenly the AIC values for the PGLS equation were
> >> dramatically different, to the point that it favored a Brownian PGLS
> model
> >> over all other models. This was despite the fact that previously I found
> >> that an OLS model and an OU model had a better model fit than a Brownian
> >> model, and the other accuracy statistics of interest (like percent
> error,
> >> this being a model intended for use in predicting new data) also found
> OLS
> >> and OU models to fit better than a Brownian PGLS model. The regression
> line
> >> for a Brownian model doesn't even fit the data at all due to being
> biased
> >> by a basal clade. The model also has a high amount of phylogenetic
> inertia
> >> which again would seemingly make an OU model a better option.
> >>
> >> I used drop.tip to remove the additional taxa to see if I could
> replicate
> >> my previous results, but it turns out I still couldn't replicate the
> >> results. That's when I realized what was causing the change in AIC
> values
> >> wasn't the taxon selection, but the tree I was using. If I used the old
> >> VertLife tree I could replicate the results, but the new VertLife tree
> >> produced radically different results despite using the same tips. So
> what I
> >> decided to do is rerun the analysis for all 100 trees I had available,
> and
> >> it turned out there was a massive amount of variation in AIC depending
> on
> >> what tree was chosen. I tried including an html data printout to show
> the
> >> precise results and how I got them, but I couldn't attach them because
> the
> >> mailer daemon kept saying they were too large. The AIC values between
> trees
> >> vary by almost 200 points after excluding extreme outliers, when model
> >> differences of 2 or more are often considered to represent statistically
> >> detectable differences. The unusually low AIC I got when I first ran the
> >> analysis happened to be because the first tree in the 100 trees merely
> >> happened to produce a lower-than-average AIC than the whole sample. The
> >> average AIC out of the 100 trees was higher than for the OLS model,
> which
> >> again makes sense given the distribution of the data.
> >>
> >> However, and this is where my problem comes in, how do I make
> appropriate
> >> model selections for PGLS if there is such a massive amount of
> variation in
> >> AIC? Especially given that between the trees in the sample there is
> enough
> >> variation that it can cause one model to be favored over another? Just
> >> picking one tree and going with that seems counterintuitive, because
> it's
> >> not very objective and theoretically someone could pick a specific tree
> to
> >> get the results they want, or accidentally pick a tree that might
> support
> >> the wrong model as seen here. On top of that the tree topologies are
> more
> >> or less identical: the same 404 taxa are present in all trees and the
> trees
> >> have nearly identical topologies, the only real differences between
> trees
> >> are branch lengths. But given this, how can I justify which AIC value I
> >> report, which in turn means which model is best supported?
> >>
> >> I did try looking at the phylo_lm function in the sensiphy package, but
> >> that function doesn't seem to provide any method of performing model
> >> selection between different regression models. It does seemingly report
> >> AIC, but the AIC the function reported was dramatically different from
> the
> >> aic I got using the gls function in ape and nlme.
> >>
> >> Sincerely,
> >> Russell
> >>
> >>      [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> R-sig-phylo mailing list - R-sig-phylo@r-project.org <mailto:
> R-sig-phylo@r-project.org>
> >>
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-sig-phylo__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpGvRdojLg$
> <
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-sig-phylo__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpGvRdojLg$>
>
> >> Searchable archive at
> https://urldefense.com/v3/__http://www.mail-archive.com/r-sig-phylo@r-project.org/__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpF9l3tadg$
> <
> https://urldefense.com/v3/__http://www.mail-archive.com/r-sig-phylo@r-project.org/__;!!IKRxdwAv5BmarQ!J5VHDyumBg-_TLx239V3qrIJkgNlKLzuB6l9A_5abdDzeSOOXpHUKardHpF9l3tadg$>
>
> >
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to