Re: [R-sig-phylo] Model Selection and PGLS

Simone Blomberg Sun, 04 Jul 2021 10:20:54 -0700

I agree. Ted and Tony's paper shows exactly how to get phylogeneticallyinformed predictions for new species, conditional on the values for theexplanatory variables and the phylogenetic relationship of the "new"species to the rest of the species in the dataset. It is there in theAppendices. (I really love this paper!) Take-home message: You caneither re-root the tree to predict the value for the new species at theroot in an independent contrasts situation (Appendix A), or you can useGLS if you know the phylogenetic relationships of the new species withthe old. (Appendix B). Of course, PIC and GLS are equivalent under aBrownian motion model of evolution (Blomberg et al 2012https://doi.org/10.1093/sysbio/syr118)


Cheers,

Simone.

On 1/7/21 3:09 pm, Theodore Garland wrote:

Russell,
Please read this paper:
https://pubmed.ncbi.nlm.nih.gov/10718731/
Cheers
Ted


On Wed, Jun 30, 2021, 9:21 PM Russell Engelman <neovenatori...@gmail.com>
wrote:

Dear All,

What you see is the large uncertainty in “ancestral” states, which is part

of the intercept here. The linear relationship that you overlaid on top of
your data is the relationship predicted at the root of the tree (as if such
a thing existed!). There is a lot of uncertainty about the intercept, but
much less uncertainty in the slope. It looks like the slope is not affected
by the inclusion or exclusion of monotremes. (for one possible reference on
the greater precision in the slope versus the intercept, there’s this:
http://dx.doi.org/10.1214/13-AOS1105 for the BM).


Yes, that sounds right from the other data I have. The line approximates
what would be expected for the root of Mammalia, and the signal in the PGLS
is more due to shifts in the y-intercept than shifts in slope, which in
turn is supported by the anatomy of the proxy.

My second cent is that the phylogenetic predictions should be stable. The

uncertainty in the intercept —and the large effect of including monotremes
on the intercept— should not affect predictions, so long as you know for
which species you want to make a prediction. If you want to make prediction
for a species in a small clade “far” from monotremes, say, then the
prediction is probably quite stable, even if you include monotremes: this
is because the phylogenetic prediction should use the phylogenetic
relationships for the species to be predicted. A prediction that uses the
linear relationship at the root and ignores the placement of the species
would be the worst-case scenario: for a mammal species with a completely
unknown placement within mammals.


This is what I'm a bit confused about. I was always told (and it seemingly
implies this in some of the PGLS literature I read like Rohlf 2011 and
Smaers and Rohlf 2016) that it isn't possible to include phylogenetic data
from the new data points into the prediction in order to improve
predictions. I'm a little confused as to whether it's possible or not (see
below).

There’s probably a number of software that do phylogenetic prediction. I

know of Rphylopars and PhyloNetworks.


I will take a look into those.

I think that Cécile' and Theodore' point is important and too often

overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
not simply obtained from the fitted line but should incorporates
information from the (evolutionary here) model.


      There’s a way to impute phylogenetic signal back into a PGLS model? I
am super surprised at that. I’ve talked to at least three different
colleagues who use PGLS about this issue, and all of them had told me that
there is no way to input phylogenetic signal back into the model for new
data points and I should just go with the single regression line the model
gives me (i.e., the regression line for the ancestral node).

      I tried looking around to see what previous researchers used when
using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
with the best fit line with the ancestral node, i.e., looking at their
reported results they give a simple trait~predictor equation that does not
include phylogeny when calculating new data. Campion and Evans 2012 used
PIC versus PGLS, which I know are technically equivalent but it doesn't
seem like they included phylogenetic information when they predicted new
data: they used their equations on dinosaurs but there are no dinosaurs in
the tree they used. I know that it’s possible to incorporate phylogenetic
signal into the new data using PVR but PVR has been criticized for other
reasons.

     This is something that seems really, really concerning because if
there is a method of using phylogenetic covariance to adjust the position
of new data points it seems like a lot of workers don’t know these methods
exist, to the point that even published papers overlook it. This was
something I was hoping to highlight in a later paper on the data, but it
sounds like people might have discussed it already. I remember talking with
my colleagues a lot about "isn't there some way to incorporate phylogenetic
information back into the model to improve accuracy of the prediction if we
know where the taxon is positioned?" and they just thought there wasn't a
way.

Regarding the model comparison, I would simply avoid it (or limit it) by

fitting models flexible enough to accommodate between your BM and OLS case
and summarize the results obtained across all the trees…


I am not entirely sure what is meant here. Do you mean fitting both an OLS
and BM model and comparing both models? I am reporting both, but my concern
is about which model I report is the best one to use going forward, since
the BM model is seemingly less accurate (though I am just taking the fitted
values from the PGLS model, which I don't think include phylogenetic
information). The two models I use produce dramatically different results,
for example the BM model produces body mass estimates which are 25% larger
than OLS.

Right now PGLS is something I would avoid if I had the option (if for no
other reason than not put all of the analyses in a single, overloaded
manuscript [the manuscript is already about 90 pages] and deviate from the
scope of the study), but I'm sure you know that most regression analyses
nowadays require some sort of preliminary PCM to be acceptable.

Sincerely,
Russell

On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel <julien.cla...@hotmail.fr>
wrote:

I think that Cécile' and Theodore' point is important and too often
overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
not simply obtained from the fitted line but should incorporates
information from the (evolutionary here) model.

For multivariate linear model you can also do it by specifying a tree
including both the species used to build the model and the ones you want to
predict using the “predict” function in mvMORPH (I think that Rphylopars
can deal with multivariate phylogenetic regression too).

Regarding the model comparison, I would simply avoid it (or limit it) by
fitting models flexible enough to accommodate between your BM and OLS case
and summarize the results obtained across all the trees…

Julien


De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de
Theodore Garland <theodore.garl...@ucr.edu>
Envoyé : mercredi 30 juin 2021 03:26
À : Cecile Ane <cecile....@wisc.edu>
Cc : mailman, r-sig-phylo <r-sig-phylo@r-project.org>;
neovenatori...@gmail.com <neovenatori...@gmail.com>
Objet : Re: [R-sig-phylo] Model Selection and PGLS

All true.  I would just add two things.  First, always graph your data and
do ordinary OLS analyses as a reality check.

Second, I think this is the original paper for phylogenetic prediction:
Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
present: confidence intervals for regression equations in phylogenetic
comparative methods. American Naturalist 155:346–364.
There, we talk about the Equivalency of the Independent-Contrasts and
Generalized Least Squares Approaches.

Cheers,
Ted


On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane <cecile....@wisc.edu> wrote:

Hi Russel,

What you see is the large uncertainty in “ancestral” states, which is

part

of the intercept here. The linear relationship that you overlaid on top

of

your data is the relationship predicted at the root of the tree (as if

such

a thing existed!). There is a lot of uncertainty about the intercept,

but

much less uncertainty in the slope. It looks like the slope is not

affected

by the inclusion or exclusion of monotremes. (for one possible

reference on

the greater precision in the slope versus the intercept, there’s this:
http://dx.doi.org/10.1214/13-AOS1105 for the BM).

My second cent is that the phylogenetic predictions should be stable.

The

uncertainty in the intercept —and the large effect of including

monotremes

on the intercept— should not affect predictions, so long as you know for
which species you want to make a prediction. If you want to make

prediction

for a species in a small clade “far” from monotremes, say, then the
prediction is probably quite stable, even if you include monotremes:

this

is because the phylogenetic prediction should use the phylogenetic
relationships for the species to be predicted. A prediction that uses

the

linear relationship at the root and ignores the placement of the species
would be the worst-case scenario: for a mammal species with a completely
unknown placement within mammals.

There’s probably a number of software that do phylogenetic prediction. I
know of Rphylopars and PhyloNetworks.

my 2 cents…
Cecile

---
Cécile Ané, Professor (she/her)
H. I. Romnes Faculty Fellow
Departments of Statistics and of Botany
University of Wisconsin - Madison
www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>

CALS statistical consulting lab:
https://calslab.cals.wisc.edu/stat-consulting/



On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com<mailto:
neovenatori...@gmail.com> wrote:

Dear All,

So this is the main problem I'm facing (see attached figure, which

should

be small enough to post). When I calculate the best-fit line under a
Brownian model, this produces a best-fit line that more or less bypasses
the distribution of the data altogether. I did some testing and found

that

this result was driven solely by the presence of Monotremata, resulting

in

the model heavily downweighting all of the phylogenetic variation within
Theria in favor of the deep divergence between Monotremata and Theria.
Excluding Monotremata produces a PGLS fit that's comparable enough to

the

OLS and OU model fit to be justifiable (though I can't just throw out
Monotremata for the sake of throwing it out).

I am planning to do a more theoretical investigation into the effect of
Monotremata on the PGLS fit in a future study, but right now what I am
trying to do is perform a study in which I use this data to construct a
regression model that can be used to predict new data. Which is why I am
trying to use AIC to potentially justify going with OLS or an OU model

over

a Brownian model. From a practical perspective the Brownian model is

almost

unusable because it produces systematically biased estimates with high
error rates when applied to new data (error rate is roughly double that

of

both the OLS and OU model). This is especially the case because the data
must be back-transformed into an arithmetic scale to be useable, and

thus a

seemingly minor difference in regression models results in a massive
difference in predicted values. However, I need some objective test to

show

that OLS fits the data better than the Brownian model, hence why I was
going with AIC. Overall, OLS does seem to outperform the Brownian model

on

average, but the variation in AIC is so high it is hard to interpret

this.

This is kind of why I am leery of assuming a null Brownian model. A
Brownian model, if anything, does not seem to accurately model the
relationship between variables.

This is why I am having trouble figuring out how to do model selection.
Just going with accuracy statistics like percent error or standard

error of

the estimate OLS is better from a purely practical sense (it doesn't

work

for the monotreme taxa, but it turns out that estimate error in the
monotremes is only decreased by 10% in a Brownian model when it
overestimates mass by nearly 75%, so the improvement really isn't worth

it

and using this for monotremes isn't recommended in the first place), but
the reviewers are expressing skepticism over the fact that the Brownian
model produces less useable results. And I'm not entirely sure the best

way

to go about the PGLS if using one of the birth-death trees isn't ideal,
perhaps what Dr. Upham says about using the DNA tree might work better.

Ironically, an OU model might be argued to better fit the data, despite
the concerns that Dr. Bapst mentioned. Looking at the distribution of
signal even though signal is not random, it is more accurately

described as

most taxa hewing to a stable equilibrium with rapid, high magnitude

shifts

at certain evolutionary nodes, rather than the covariation between the

two

traits evolving in a Brownian fashion. I did some experiments with a PSR
curve and the results seem to favor an OU model or other models with

uneven

rates of evolution rather than a pure Brownian model.

Of course, the broader issue I am facing is trying to deal with PGLS
succinctly; the scope of the study isn't necessarily an in-depth

comparison

between different regression models, it's more looking at how this

variable

correlates with body mass for practical purposes (for which considering
phylogeny is one part of that). It's definitely something to consider

but I

am trying to avoid manuscript bloat.

Sincerely,
Russell


         [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/

         [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


--
Simone Blomberg, BSc (Hons), PhD, MAppStat (she/her)
Senior Lecturer and Consultant Statistician
School of Biological Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
T: +61 7 3365 2506
email: S.Blomberg1_at_uq.edu.au
Twitter: @simoneb66
UQ ALLY Supporting the diversity of sexuality and gender at UQ.

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

Basically, I'm not interested in doing research
and I never have been. I'm interested in
understanding, which is quite a different thing.
And often to understand something you have to
work it out for yourself because no one else
has done it. - David Blackwell

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to