Dear R-Sig-Phylo Mailing List,

I ran into a rather unusual problem. I was doing an analysis using the
mammal trees from Upham et al. (2019) downloaded off of the VertLife site.
The model statistics for my data initially suggested that the OLS model was
better supported than a PGLS model based on Akaike Information Criterion
(AIC). The reviewers for the paper wanted me to add more taxa, so I
re-downloaded a set of trees from VertLife and reran the analysis, but when
I did I found that suddenly the AIC values for the PGLS equation were
dramatically different, to the point that it favored a Brownian PGLS model
over all other models. This was despite the fact that previously I found
that an OLS model and an OU model had a better model fit than a Brownian
model, and the other accuracy statistics of interest (like percent error,
this being a model intended for use in predicting new data) also found OLS
and OU models to fit better than a Brownian PGLS model. The regression line
for a Brownian model doesn't even fit the data at all due to being biased
by a basal clade. The model also has a high amount of phylogenetic inertia
which again would seemingly make an OU model a better option.

I used drop.tip to remove the additional taxa to see if I could replicate
my previous results, but it turns out I still couldn't replicate the
results. That's when I realized what was causing the change in AIC values
wasn't the taxon selection, but the tree I was using. If I used the old
VertLife tree I could replicate the results, but the new VertLife tree
produced radically different results despite using the same tips. So what I
decided to do is rerun the analysis for all 100 trees I had available, and
it turned out there was a massive amount of variation in AIC depending on
what tree was chosen. I tried including an html data printout to show the
precise results and how I got them, but I couldn't attach them because the
mailer daemon kept saying they were too large. The AIC values between trees
vary by almost 200 points after excluding extreme outliers, when model
differences of 2 or more are often considered to represent statistically
detectable differences. The unusually low AIC I got when I first ran the
analysis happened to be because the first tree in the 100 trees merely
happened to produce a lower-than-average AIC than the whole sample. The
average AIC out of the 100 trees was higher than for the OLS model, which
again makes sense given the distribution of the data.

However, and this is where my problem comes in, how do I make appropriate
model selections for PGLS if there is such a massive amount of variation in
AIC? Especially given that between the trees in the sample there is enough
variation that it can cause one model to be favored over another? Just
picking one tree and going with that seems counterintuitive, because it's
not very objective and theoretically someone could pick a specific tree to
get the results they want, or accidentally pick a tree that might support
the wrong model as seen here. On top of that the tree topologies are more
or less identical: the same 404 taxa are present in all trees and the trees
have nearly identical topologies, the only real differences between trees
are branch lengths. But given this, how can I justify which AIC value I
report, which in turn means which model is best supported?

I did try looking at the phylo_lm function in the sensiphy package, but
that function doesn't seem to provide any method of performing model
selection between different regression models. It does seemingly report
AIC, but the AIC the function reported was dramatically different from the
aic I got using the gls function in ape and nlme.

Sincerely,
Russell

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to