Re: [R-sig-phylo] Model Selection and PGLS

2021-06-29 Thread Nathan Upham
Hi Russell and all, sounds good.

I’d suggest that the “null model” for fitting trait data to a phylogeny should 
be single-rate Brownian motion, i.e., you’re assuming that given data on the 
ancestor-to-descendant relationships of the species (and timing of 
divergences), and assuming the trait is heritable, it is evolving at the same 
random rate along each branch.  The burden of proof is on rejecting that null 
hypothesis (not “accepting it”— sorry for earlier writing that incorrectly!).  
So if you do your AIC fitting across the 100 trees, summarize the results, and 
find no clear signal of a model being obviously better than single-rate 
Brownian, then that is what you should use for subsequent analyses.

If anyone has a different perspective on this, please chime in.  The above 
assumes that you’ve established heritability of the trait, for example by doing 
a test for phylogenetic ’signal’.

Does that help?  All the best
—nate



> On Jun 28, 2021, at 1:25 PM, Russell Engelman  
> wrote:
> 
> Dear Dr. Upham (and All),
> 
> Please don't take my initial message the wrong way, this is not meant to be a 
> dig at your 2019 study. I don’t think this is due to the birth-death tree 
> specifically but would be present in any study where there are multiple 
> phylogenetic trees to choose from or some measure of uncertainty in the tip 
> dates. I definitely agree with you that there is almost certainly going to be 
> variation in model support values if there is any difference in the 
> underlying phylogeny, however, I was surprised that AIC would vary this much 
> in a dataset where the trait data, number of tips, and branching topology 
> used to compute the model are more or less constant between trees.
> 
> My question is more along the lines of "given that it is logical to expect 
> AIC to vary based on differences between trees, how would one go about 
> determining which regression model is the "optimal" one to use for further 
> analysis"? You mentioned taking the 95% confidence intervals of the models 
> and seeing if they don't overlap, would this be just taking the singular AIC 
> from the OLS model and comparing it to the PGLS one, since OLS seemingly 
> doesn't produce a confidence interval of AIC values? And if the confidence 
> intervals do overlap, is the OLS  or PGLS considered the null hypothesis? In 
> my case the AIC for OLS is within the 95% confidence intervals for the PGLS, 
> but is much lower than the mean value (it's close to the lower first standard 
> deviation of the AIC values).
> 
> Sincerely,
> Russell
> 
> On Mon, Jun 28, 2021 at 2:46 PM Nathan Upham  <mailto:nathan.up...@asu.edu>> wrote:
> Hi Russell and all:
> 
> I’ll respond here since the answer is related to the intended purpose of the 
> VertLife mammal trees — i.e, capturing full uncertainty in node ages and 
> phylogenetic relationships was one of the motivators for building the mammal 
> trees in the way we did.  This approach contrasts to wanting to obtain the 
> single “best tree”, since methods of phylogenetic reconstruction will always 
> just be approximations of the “true tree” anyway rather than ever being equal 
> to that tree.  To only use a single consensus tree in comparative 
> phylogenetic analyses assumes that we know the true tree, which again, we 
> don’t ever in an empirical context (only for simulations).  Those points were 
> summarized well by Huelsenbeck et al. (2000: 
> http://science.sciencemag.org/content/288/5475/2349 
> <https://urldefense.com/v3/__http://science.sciencemag.org/content/288/5475/2349__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJczwsIVKQ$>),
>  but nevertheless are still not standard practice in PCMs.
> 
> To the point of AIC varying across the 100 trees, this is to be expected.  
> Any 1 tree of 100 trees from the credible set is not very meaningful; the 
> entire 100 trees need to be analyzed and then the estimate +/- SE from each 
> tree can be summarized as a distribution of values.  If the 95% CI on the 
> distribution of values excludes your hypothesis, then you’ve learned 
> something; if not, you accept the null hypothesis.  See the animated gifs 
> here (http://vertlife.org/data/mammals/ 
> <https://urldefense.com/v3/__http://vertlife.org/data/mammals/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdB22hcow$>)
>  for a better conception of why this phylogenetic uncertainty is important to 
> consider when doing model fitting or other PCMs.
> 
> That said, if a single ‘best tree’ is the target, then the DNA-only MCC tree 
> of 4098 species is a reasonable thing to analyze, more analogous to how 
> mainstream phylogenetics has presented trees for re-us

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-28 Thread Nathan Upham
Hi Russell and all:

I’ll respond here since the answer is related to the intended purpose of the 
VertLife mammal trees — i.e, capturing full uncertainty in node ages and 
phylogenetic relationships was one of the motivators for building the mammal 
trees in the way we did.  This approach contrasts to wanting to obtain the 
single “best tree”, since methods of phylogenetic reconstruction will always 
just be approximations of the “true tree” anyway rather than ever being equal 
to that tree.  To only use a single consensus tree in comparative phylogenetic 
analyses assumes that we know the true tree, which again, we don’t ever in an 
empirical context (only for simulations).  Those points were summarized well by 
Huelsenbeck et al. (2000: http://science.sciencemag.org/content/288/5475/2349), 
but nevertheless are still not standard practice in PCMs.

To the point of AIC varying across the 100 trees, this is to be expected.  Any 
1 tree of 100 trees from the credible set is not very meaningful; the entire 
100 trees need to be analyzed and then the estimate +/- SE from each tree can 
be summarized as a distribution of values.  If the 95% CI on the distribution 
of values excludes your hypothesis, then you’ve learned something; if not, you 
accept the null hypothesis.  See the animated gifs here 
(http://vertlife.org/data/mammals/) for a better conception of why this 
phylogenetic uncertainty is important to consider when doing model fitting or 
other PCMs.

That said, if a single ‘best tree’ is the target, then the DNA-only MCC tree of 
4098 species is a reasonable thing to analyze, more analogous to how mainstream 
phylogenetics has presented trees for re-use 
(https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre).
  But again, while the MCC tree is appropriate, 1 of 100 trees from the 
credible set is not.

Hope that helps.  All the best,
—nate



==
Nathan S. Upham, Ph.D. (he/him)
Assistant Research Professor & Associate Curator of Mammals
Arizona State University, School of Life Sciences, Biodiversity Knowledge 
Integration Center (BioKIC )
 ~> Check out the new Mammal Tree of Life 
 and the Mammal Diversity Database 


Research Associate, Yale University (Ecology and Evolutionary Biology)
Research Associate, Field Museum of Natural History (Negaunee Integrative 
Research Center)
Chair, Biodiversity Committee, American Society of Mammalogists
Taxonomy Advisor, IUCN/SSC Small Mammal Specialist Group

personal web: n8u.org | Google Scholar 

 | ASU profile 
e: nathan.up...@asu.edu | Skype: nate_upham | Twitter: @n8_upham 
 
=



> On Jun 28, 2021, at 10:47 AM, Russell Engelman  
> wrote:
> 
> Dear R-Sig-Phylo Mailing List,
> 
> I ran into a rather unusual problem. I was doing an analysis using the
> mammal trees from Upham et al. (2019) downloaded off of the VertLife site.
> The model statistics for my data initially suggested that the OLS model was
> better supported than a PGLS model based on Akaike Information Criterion
> (AIC). The reviewers for the paper wanted me to add more taxa, so I
> re-downloaded a set of trees from VertLife and reran the analysis, but when
> I did I found that suddenly the AIC values for the PGLS equation were
> dramatically different, to the point that it favored a Brownian PGLS model
> over all other models. This was despite the fact that previously I found
> that an OLS model and an OU model had a better model fit than a Brownian
> model, and the other accuracy statistics of interest (like percent error,
> this being a model intended for use in predicting new data) also found OLS
> and OU models to fit better than a Brownian PGLS model. The regression line
> for a Brownian model doesn't even fit the data at all due to being biased
> by a basal clade. The model also has a high amount of phylogenetic inertia
> which again would seemingly make an OU model a better option.
> 
> I used drop.tip to remove the additional taxa to see if I could replicate
> my previous results, but it turns out I still couldn't replicate the
> results. That's when I realized what was causing the change in AIC values
> wasn't the taxon selection, but the tree I was using. If I used the old
> VertLife tree I could replicate the results, but the new VertLife tree
> produced radically different results despite using the same tips. So what I
> decided to do is rerun the analysis for all 100 trees I had available, and
> it turned 

Re: [R-sig-phylo] Using bind.tip and lapply on a multiPhylo object

2021-06-02 Thread Nathan Upham
Hi Russell:

Glad to hear you’re using the VertLife mammal trees — they are built on a 
taxonomy of 5,911 species of which only 4,098 are sampled for DNA, so there is 
already a ~30% chunk that is placed using taxonomic constraints and birth-death 
branch lengths as sampled during the estimation of 28 Bayesian patch clades.  

Adding additional species described since the 2015 cutoff of that VertLife 
taxonomy makes sense (e.g., up to ~6,500 species on mammaldiversity.org).  
However, keep in mind that they will not have birth-death estimated branch 
lengths, but rather more likely be added as a polygamy to given clade and then 
randomly resolved.

Given the sample code you provided, the key thing you’ll want to do is run a 
*loop* rather than using lapply, so that you can specify a given tree each 
time, e.g.:

newtrees<-vector(“list”,length(trees))
for(j in 1:length(trees)){
newtrees[[j]] <- bind.tip(tree=trees[[j]], 
tip.label="Cercopithecus_albogularis”, position=0.59,edge.length = 0.59, 
where=mrca(tree[[j]])["Cercopithecus_mitis","Cercopithecus_mitis"])
}

I also wrote some code to prune mammal trees and add extinct Caribbean species, 
which uses a similar approach of making polytomies and randomly resolving them 
— here is the repo:
https://github.com/n8upham/CaribbeanExtinctions-WTWTW/tree/master/mamPhy_pruningCode
And here is the code file:
https://github.com/n8upham/CaribbeanExtinctions-WTWTW/blob/master/mamPhy_pruningCode/pruningCode_MamPhy-to-CaribbeanTaxa.R

Hope that helps,
—nate




Nathan S. Upham, Ph.D. (he/him)
Assistant Research Professor & Associate Curator of Mammals
Arizona State University, School of Life Sciences
 ~> Check out the new Mammal Tree of Life 
 and the Mammal Diversity Database 


Research Associate, Yale University (Ecology and Evolutionary Biology)
Research Associate, Field Museum of Natural History (Negaunee Integrative 
Research Center)
Chair, Biodiversity Committee, American Society of Mammalogists
Taxonomy Advisor, IUCN/SSC Small Mammal Specialist Group

personal web: n8u.org | Google Scholar 

 | ASU profile 
e: nathan.up...@asu.edu | Skype: nate_upham | Twitter: @n8_upham 
 




> On Jun 2, 2021, at 4:19 PM, Eliot Miller  wrote:
> 
> Hi Russell,
> 
> A package I wrote a while back should be able to do that fairly easily.
> https://urldefense.com/v3/__https://github.com/eliotmiller/addTaxa__;!!IKRxdwAv5BmarQ!OZj7-dFRbxvUothKjSj6hr9B0eXscAO6LVWi1-a0w3J_PxlDqvsFDNb0lQrzxl2aIw$
>   The only paper it's described in
> remains 
> https://urldefense.com/v3/__https://bsapubs.onlinelibrary.wiley.com/doi/full/10.3732/ajb.1500195__;!!IKRxdwAv5BmarQ!OZj7-dFRbxvUothKjSj6hr9B0eXscAO6LVWi1-a0w3J_PxlDqvsFDNb0lQp7PnRRHg$
>  
> It's a wrapper for bind.tip, with some additional stuff. You basically
> would give it a taxonomic file where you identify the clades you're
> interested in (e.g. both of those Cercopithecus species could be named some
> unique clade name and off you go, it'd add the missing one to the other),
> then lapply that whole addTaxa command over the list of trees in
> multiPhylo. At some point I made laser a dependency, and it's possible I
> left it in that state. If that's the case, you can still get laser from old
> CRAN mirrors I believe. Let me know if you want more help.
> 
> Best,
> Eliot
> 
> On Wed, Jun 2, 2021 at 7:05 PM Russell Engelman 
> wrote:
> 
>> Dear R-sig-phylo,
>> 
>> I have been working with a mammalian phylogeny I recently downloaded from
>> VertLife 
>> (https://urldefense.com/v3/__http://vertlife.org/phylosubsets/__;!!IKRxdwAv5BmarQ!OZj7-dFRbxvUothKjSj6hr9B0eXscAO6LVWi1-a0w3J_PxlDqvsFDNb0lQoEdEnMAg$
>>  ). Unfortunately, the phylogeny
>> is missing a large number of species, so I am trying to manually add these
>> taxa to the phylogeny. I have a series of 100 trees that I am using to do
>> things such as test for phylogenetic signal. I know how to use bind.tip to
>> add new taxa to a single tree, but I am having more trouble with a
>> multiPhylo object. I am primarily adding these taxa by placing them as
>> sister to their nearest included relative (since most of them are elevated
>> former subspecies), but the issue here is that in the 100 trees in the
>> multiPhylo object the node representing the taxon to bind these taxa to is
>> not the same across all trees due to shifting topologies.
>> 
>> This is an example of the code I have been using, in which "tree" is the
>> tree object. This works for a single 'phylo' tree but not 'multiphylo'.
>> 
>> ```
>>