Re: [R-sig-phylo] Model Selection and PGLS

Chris Organ Fri, 02 Jul 2021 09:39:51 -0700

Hi Russell,

And, for a fully PGLS and Bayesian model, see this:
https://pubmed.ncbi.nlm.nih.gov/17344851/


Best, Chris

On Thu, Jul 1, 2021 at 1:10 AM Theodore Garland
<theodore.garl...@ucr.edu> wrote:
>
> Russell,
> Please read this paper:
> https://pubmed.ncbi.nlm.nih.gov/10718731/
> Cheers
> Ted
>
>
> On Wed, Jun 30, 2021, 9:21 PM Russell Engelman <neovenatori...@gmail.com>
> wrote:
>
> > Dear All,
> >
> > What you see is the large uncertainty in “ancestral” states, which is part
> >> of the intercept here. The linear relationship that you overlaid on top of
> >> your data is the relationship predicted at the root of the tree (as if such
> >> a thing existed!). There is a lot of uncertainty about the intercept, but
> >> much less uncertainty in the slope. It looks like the slope is not affected
> >> by the inclusion or exclusion of monotremes. (for one possible reference on
> >> the greater precision in the slope versus the intercept, there’s this:
> >> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >
> >
> > Yes, that sounds right from the other data I have. The line approximates
> > what would be expected for the root of Mammalia, and the signal in the PGLS
> > is more due to shifts in the y-intercept than shifts in slope, which in
> > turn is supported by the anatomy of the proxy.
> >
> > My second cent is that the phylogenetic predictions should be stable. The
> >> uncertainty in the intercept —and the large effect of including monotremes
> >> on the intercept— should not affect predictions, so long as you know for
> >> which species you want to make a prediction. If you want to make prediction
> >> for a species in a small clade “far” from monotremes, say, then the
> >> prediction is probably quite stable, even if you include monotremes: this
> >> is because the phylogenetic prediction should use the phylogenetic
> >> relationships for the species to be predicted. A prediction that uses the
> >> linear relationship at the root and ignores the placement of the species
> >> would be the worst-case scenario: for a mammal species with a completely
> >> unknown placement within mammals.
> >
> >
> > This is what I'm a bit confused about. I was always told (and it seemingly
> > implies this in some of the PGLS literature I read like Rohlf 2011 and
> > Smaers and Rohlf 2016) that it isn't possible to include phylogenetic data
> > from the new data points into the prediction in order to improve
> > predictions. I'm a little confused as to whether it's possible or not (see
> > below).
> >
> > There’s probably a number of software that do phylogenetic prediction. I
> >> know of Rphylopars and PhyloNetworks.
> >
> >
> > I will take a look into those.
> >
> > I think that Cécile' and Theodore' point is important and too often
> >> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
> >> not simply obtained from the fitted line but should incorporates
> >> information from the (evolutionary here) model.
> >
> >
> >      There’s a way to impute phylogenetic signal back into a PGLS model? I
> > am super surprised at that. I’ve talked to at least three different
> > colleagues who use PGLS about this issue, and all of them had told me that
> > there is no way to input phylogenetic signal back into the model for new
> > data points and I should just go with the single regression line the model
> > gives me (i.e., the regression line for the ancestral node).
> >
> >      I tried looking around to see what previous researchers used when
> > using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
> > Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
> > with the best fit line with the ancestral node, i.e., looking at their
> > reported results they give a simple trait~predictor equation that does not
> > include phylogeny when calculating new data. Campion and Evans 2012 used
> > PIC versus PGLS, which I know are technically equivalent but it doesn't
> > seem like they included phylogenetic information when they predicted new
> > data: they used their equations on dinosaurs but there are no dinosaurs in
> > the tree they used. I know that it’s possible to incorporate phylogenetic
> > signal into the new data using PVR but PVR has been criticized for other
> > reasons.
> >
> >     This is something that seems really, really concerning because if
> > there is a method of using phylogenetic covariance to adjust the position
> > of new data points it seems like a lot of workers don’t know these methods
> > exist, to the point that even published papers overlook it. This was
> > something I was hoping to highlight in a later paper on the data, but it
> > sounds like people might have discussed it already. I remember talking with
> > my colleagues a lot about "isn't there some way to incorporate phylogenetic
> > information back into the model to improve accuracy of the prediction if we
> > know where the taxon is positioned?" and they just thought there wasn't a
> > way.
> >
> > Regarding the model comparison, I would simply avoid it (or limit it) by
> >> fitting models flexible enough to accommodate between your BM and OLS case
> >> and summarize the results obtained across all the trees…
> >
> >
> > I am not entirely sure what is meant here. Do you mean fitting both an OLS
> > and BM model and comparing both models? I am reporting both, but my concern
> > is about which model I report is the best one to use going forward, since
> > the BM model is seemingly less accurate (though I am just taking the fitted
> > values from the PGLS model, which I don't think include phylogenetic
> > information). The two models I use produce dramatically different results,
> > for example the BM model produces body mass estimates which are 25% larger
> > than OLS.
> >
> > Right now PGLS is something I would avoid if I had the option (if for no
> > other reason than not put all of the analyses in a single, overloaded
> > manuscript [the manuscript is already about 90 pages] and deviate from the
> > scope of the study), but I'm sure you know that most regression analyses
> > nowadays require some sort of preliminary PCM to be acceptable.
> >
> > Sincerely,
> > Russell
> >
> > On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel <julien.cla...@hotmail.fr>
> > wrote:
> >
> >> I think that Cécile' and Theodore' point is important and too often
> >> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
> >> not simply obtained from the fitted line but should incorporates
> >> information from the (evolutionary here) model.
> >>
> >> For multivariate linear model you can also do it by specifying a tree
> >> including both the species used to build the model and the ones you want to
> >> predict using the “predict” function in mvMORPH (I think that Rphylopars
> >> can deal with multivariate phylogenetic regression too).
> >>
> >> Regarding the model comparison, I would simply avoid it (or limit it) by
> >> fitting models flexible enough to accommodate between your BM and OLS case
> >> and summarize the results obtained across all the trees…
> >>
> >> Julien
> >>
> >>
> >> De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de
> >> Theodore Garland <theodore.garl...@ucr.edu>
> >> Envoyé : mercredi 30 juin 2021 03:26
> >> À : Cecile Ane <cecile....@wisc.edu>
> >> Cc : mailman, r-sig-phylo <r-sig-phylo@r-project.org>;
> >> neovenatori...@gmail.com <neovenatori...@gmail.com>
> >> Objet : Re: [R-sig-phylo] Model Selection and PGLS
> >>
> >> All true.  I would just add two things.  First, always graph your data and
> >> do ordinary OLS analyses as a reality check.
> >>
> >> Second, I think this is the original paper for phylogenetic prediction:
> >> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
> >> present: confidence intervals for regression equations in phylogenetic
> >> comparative methods. American Naturalist 155:346–364.
> >> There, we talk about the Equivalency of the Independent-Contrasts and
> >> Generalized Least Squares Approaches.
> >>
> >> Cheers,
> >> Ted
> >>
> >>
> >> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane <cecile....@wisc.edu> wrote:
> >>
> >> > Hi Russel,
> >> >
> >> > What you see is the large uncertainty in “ancestral” states, which is
> >> part
> >> > of the intercept here. The linear relationship that you overlaid on top
> >> of
> >> > your data is the relationship predicted at the root of the tree (as if
> >> such
> >> > a thing existed!). There is a lot of uncertainty about the intercept,
> >> but
> >> > much less uncertainty in the slope. It looks like the slope is not
> >> affected
> >> > by the inclusion or exclusion of monotremes. (for one possible
> >> reference on
> >> > the greater precision in the slope versus the intercept, there’s this:
> >> > http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >> >
> >> > My second cent is that the phylogenetic predictions should be stable.
> >> The
> >> > uncertainty in the intercept —and the large effect of including
> >> monotremes
> >> > on the intercept— should not affect predictions, so long as you know for
> >> > which species you want to make a prediction. If you want to make
> >> prediction
> >> > for a species in a small clade “far” from monotremes, say, then the
> >> > prediction is probably quite stable, even if you include monotremes:
> >> this
> >> > is because the phylogenetic prediction should use the phylogenetic
> >> > relationships for the species to be predicted. A prediction that uses
> >> the
> >> > linear relationship at the root and ignores the placement of the species
> >> > would be the worst-case scenario: for a mammal species with a completely
> >> > unknown placement within mammals.
> >> >
> >> > There’s probably a number of software that do phylogenetic prediction. I
> >> > know of Rphylopars and PhyloNetworks.
> >> >
> >> > my 2 cents…
> >> > Cecile
> >> >
> >> > ---
> >> > Cécile Ané, Professor (she/her)
> >> > H. I. Romnes Faculty Fellow
> >> > Departments of Statistics and of Botany
> >> > University of Wisconsin - Madison
> >> > www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>
> >> >
> >> > CALS statistical consulting lab:
> >> > https://calslab.cals.wisc.edu/stat-consulting/
> >> >
> >> >
> >> >
> >> > On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com<mailto:
> >> > neovenatori...@gmail.com> wrote:
> >> >
> >> > Dear All,
> >> >
> >> > So this is the main problem I'm facing (see attached figure, which
> >> should
> >> > be small enough to post). When I calculate the best-fit line under a
> >> > Brownian model, this produces a best-fit line that more or less bypasses
> >> > the distribution of the data altogether. I did some testing and found
> >> that
> >> > this result was driven solely by the presence of Monotremata, resulting
> >> in
> >> > the model heavily downweighting all of the phylogenetic variation within
> >> > Theria in favor of the deep divergence between Monotremata and Theria.
> >> > Excluding Monotremata produces a PGLS fit that's comparable enough to
> >> the
> >> > OLS and OU model fit to be justifiable (though I can't just throw out
> >> > Monotremata for the sake of throwing it out).
> >> >
> >> > I am planning to do a more theoretical investigation into the effect of
> >> > Monotremata on the PGLS fit in a future study, but right now what I am
> >> > trying to do is perform a study in which I use this data to construct a
> >> > regression model that can be used to predict new data. Which is why I am
> >> > trying to use AIC to potentially justify going with OLS or an OU model
> >> over
> >> > a Brownian model. From a practical perspective the Brownian model is
> >> almost
> >> > unusable because it produces systematically biased estimates with high
> >> > error rates when applied to new data (error rate is roughly double that
> >> of
> >> > both the OLS and OU model). This is especially the case because the data
> >> > must be back-transformed into an arithmetic scale to be useable, and
> >> thus a
> >> > seemingly minor difference in regression models results in a massive
> >> > difference in predicted values. However, I need some objective test to
> >> show
> >> > that OLS fits the data better than the Brownian model, hence why I was
> >> > going with AIC. Overall, OLS does seem to outperform the Brownian model
> >> on
> >> > average, but the variation in AIC is so high it is hard to interpret
> >> this.
> >> >
> >> > This is kind of why I am leery of assuming a null Brownian model. A
> >> > Brownian model, if anything, does not seem to accurately model the
> >> > relationship between variables.
> >> >
> >> > This is why I am having trouble figuring out how to do model selection.
> >> > Just going with accuracy statistics like percent error or standard
> >> error of
> >> > the estimate OLS is better from a purely practical sense (it doesn't
> >> work
> >> > for the monotreme taxa, but it turns out that estimate error in the
> >> > monotremes is only decreased by 10% in a Brownian model when it
> >> > overestimates mass by nearly 75%, so the improvement really isn't worth
> >> it
> >> > and using this for monotremes isn't recommended in the first place), but
> >> > the reviewers are expressing skepticism over the fact that the Brownian
> >> > model produces less useable results. And I'm not entirely sure the best
> >> way
> >> > to go about the PGLS if using one of the birth-death trees isn't ideal,
> >> > perhaps what Dr. Upham says about using the DNA tree might work better.
> >> >
> >> > Ironically, an OU model might be argued to better fit the data, despite
> >> > the concerns that Dr. Bapst mentioned. Looking at the distribution of
> >> > signal even though signal is not random, it is more accurately
> >> described as
> >> > most taxa hewing to a stable equilibrium with rapid, high magnitude
> >> shifts
> >> > at certain evolutionary nodes, rather than the covariation between the
> >> two
> >> > traits evolving in a Brownian fashion. I did some experiments with a PSR
> >> > curve and the results seem to favor an OU model or other models with
> >> uneven
> >> > rates of evolution rather than a pure Brownian model.
> >> >
> >> > Of course, the broader issue I am facing is trying to deal with PGLS
> >> > succinctly; the scope of the study isn't necessarily an in-depth
> >> comparison
> >> > between different regression models, it's more looking at how this
> >> variable
> >> > correlates with body mass for practical purposes (for which considering
> >> > phylogeny is one part of that). It's definitely something to consider
> >> but I
> >> > am trying to avoid manuscript bloat.
> >> >
> >> > Sincerely,
> >> > Russell
> >> >
> >> >
> >> >         [[alternative HTML version deleted]]
> >> >
> >> > _______________________________________________
> >> > R-sig-phylo mailing list - R-sig-phylo@r-project.org
> >> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> >> > Searchable archive at
> >> > http://www.mail-archive.com/r-sig-phylo@r-project.org/
> >> >
> >>
> >>         [[alternative HTML version deleted]]
> >>
> >> _______________________________________________
> >> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> >> Searchable archive at
> >> http://www.mail-archive.com/r-sig-phylo@r-project.org/
> >
> >
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to