Re: [R-sig-phylo] Model Selection and PGLS

Theodore Garland Wed, 30 Jun 2021 22:10:08 -0700

Russell,
Please read this paper:
https://pubmed.ncbi.nlm.nih.gov/10718731/
Cheers
Ted



On Wed, Jun 30, 2021, 9:21 PM Russell Engelman <neovenatori...@gmail.com>
wrote:

> Dear All,
>
> What you see is the large uncertainty in “ancestral” states, which is part
>> of the intercept here. The linear relationship that you overlaid on top of
>> your data is the relationship predicted at the root of the tree (as if such
>> a thing existed!). There is a lot of uncertainty about the intercept, but
>> much less uncertainty in the slope. It looks like the slope is not affected
>> by the inclusion or exclusion of monotremes. (for one possible reference on
>> the greater precision in the slope versus the intercept, there’s this:
>> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>
>
> Yes, that sounds right from the other data I have. The line approximates
> what would be expected for the root of Mammalia, and the signal in the PGLS
> is more due to shifts in the y-intercept than shifts in slope, which in
> turn is supported by the anatomy of the proxy.
>
> My second cent is that the phylogenetic predictions should be stable. The
>> uncertainty in the intercept —and the large effect of including monotremes
>> on the intercept— should not affect predictions, so long as you know for
>> which species you want to make a prediction. If you want to make prediction
>> for a species in a small clade “far” from monotremes, say, then the
>> prediction is probably quite stable, even if you include monotremes: this
>> is because the phylogenetic prediction should use the phylogenetic
>> relationships for the species to be predicted. A prediction that uses the
>> linear relationship at the root and ignores the placement of the species
>> would be the worst-case scenario: for a mammal species with a completely
>> unknown placement within mammals.
>
>
> This is what I'm a bit confused about. I was always told (and it seemingly
> implies this in some of the PGLS literature I read like Rohlf 2011 and
> Smaers and Rohlf 2016) that it isn't possible to include phylogenetic data
> from the new data points into the prediction in order to improve
> predictions. I'm a little confused as to whether it's possible or not (see
> below).
>
> There’s probably a number of software that do phylogenetic prediction. I
>> know of Rphylopars and PhyloNetworks.
>
>
> I will take a look into those.
>
> I think that Cécile' and Theodore' point is important and too often
>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
>> not simply obtained from the fitted line but should incorporates
>> information from the (evolutionary here) model.
>
>
>      There’s a way to impute phylogenetic signal back into a PGLS model? I
> am super surprised at that. I’ve talked to at least three different
> colleagues who use PGLS about this issue, and all of them had told me that
> there is no way to input phylogenetic signal back into the model for new
> data points and I should just go with the single regression line the model
> gives me (i.e., the regression line for the ancestral node).
>
>      I tried looking around to see what previous researchers used when
> using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
> Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
> with the best fit line with the ancestral node, i.e., looking at their
> reported results they give a simple trait~predictor equation that does not
> include phylogeny when calculating new data. Campion and Evans 2012 used
> PIC versus PGLS, which I know are technically equivalent but it doesn't
> seem like they included phylogenetic information when they predicted new
> data: they used their equations on dinosaurs but there are no dinosaurs in
> the tree they used. I know that it’s possible to incorporate phylogenetic
> signal into the new data using PVR but PVR has been criticized for other
> reasons.
>
>     This is something that seems really, really concerning because if
> there is a method of using phylogenetic covariance to adjust the position
> of new data points it seems like a lot of workers don’t know these methods
> exist, to the point that even published papers overlook it. This was
> something I was hoping to highlight in a later paper on the data, but it
> sounds like people might have discussed it already. I remember talking with
> my colleagues a lot about "isn't there some way to incorporate phylogenetic
> information back into the model to improve accuracy of the prediction if we
> know where the taxon is positioned?" and they just thought there wasn't a
> way.
>
> Regarding the model comparison, I would simply avoid it (or limit it) by
>> fitting models flexible enough to accommodate between your BM and OLS case
>> and summarize the results obtained across all the trees…
>
>
> I am not entirely sure what is meant here. Do you mean fitting both an OLS
> and BM model and comparing both models? I am reporting both, but my concern
> is about which model I report is the best one to use going forward, since
> the BM model is seemingly less accurate (though I am just taking the fitted
> values from the PGLS model, which I don't think include phylogenetic
> information). The two models I use produce dramatically different results,
> for example the BM model produces body mass estimates which are 25% larger
> than OLS.
>
> Right now PGLS is something I would avoid if I had the option (if for no
> other reason than not put all of the analyses in a single, overloaded
> manuscript [the manuscript is already about 90 pages] and deviate from the
> scope of the study), but I'm sure you know that most regression analyses
> nowadays require some sort of preliminary PCM to be acceptable.
>
> Sincerely,
> Russell
>
> On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel <julien.cla...@hotmail.fr>
> wrote:
>
>> I think that Cécile' and Theodore' point is important and too often
>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
>> not simply obtained from the fitted line but should incorporates
>> information from the (evolutionary here) model.
>>
>> For multivariate linear model you can also do it by specifying a tree
>> including both the species used to build the model and the ones you want to
>> predict using the “predict” function in mvMORPH (I think that Rphylopars
>> can deal with multivariate phylogenetic regression too).
>>
>> Regarding the model comparison, I would simply avoid it (or limit it) by
>> fitting models flexible enough to accommodate between your BM and OLS case
>> and summarize the results obtained across all the trees…
>>
>> Julien
>>
>>
>> De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de
>> Theodore Garland <theodore.garl...@ucr.edu>
>> Envoyé : mercredi 30 juin 2021 03:26
>> À : Cecile Ane <cecile....@wisc.edu>
>> Cc : mailman, r-sig-phylo <r-sig-phylo@r-project.org>;
>> neovenatori...@gmail.com <neovenatori...@gmail.com>
>> Objet : Re: [R-sig-phylo] Model Selection and PGLS
>>
>> All true.  I would just add two things.  First, always graph your data and
>> do ordinary OLS analyses as a reality check.
>>
>> Second, I think this is the original paper for phylogenetic prediction:
>> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
>> present: confidence intervals for regression equations in phylogenetic
>> comparative methods. American Naturalist 155:346–364.
>> There, we talk about the Equivalency of the Independent-Contrasts and
>> Generalized Least Squares Approaches.
>>
>> Cheers,
>> Ted
>>
>>
>> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane <cecile....@wisc.edu> wrote:
>>
>> > Hi Russel,
>> >
>> > What you see is the large uncertainty in “ancestral” states, which is
>> part
>> > of the intercept here. The linear relationship that you overlaid on top
>> of
>> > your data is the relationship predicted at the root of the tree (as if
>> such
>> > a thing existed!). There is a lot of uncertainty about the intercept,
>> but
>> > much less uncertainty in the slope. It looks like the slope is not
>> affected
>> > by the inclusion or exclusion of monotremes. (for one possible
>> reference on
>> > the greater precision in the slope versus the intercept, there’s this:
>> > http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>> >
>> > My second cent is that the phylogenetic predictions should be stable.
>> The
>> > uncertainty in the intercept —and the large effect of including
>> monotremes
>> > on the intercept— should not affect predictions, so long as you know for
>> > which species you want to make a prediction. If you want to make
>> prediction
>> > for a species in a small clade “far” from monotremes, say, then the
>> > prediction is probably quite stable, even if you include monotremes:
>> this
>> > is because the phylogenetic prediction should use the phylogenetic
>> > relationships for the species to be predicted. A prediction that uses
>> the
>> > linear relationship at the root and ignores the placement of the species
>> > would be the worst-case scenario: for a mammal species with a completely
>> > unknown placement within mammals.
>> >
>> > There’s probably a number of software that do phylogenetic prediction. I
>> > know of Rphylopars and PhyloNetworks.
>> >
>> > my 2 cents…
>> > Cecile
>> >
>> > ---
>> > Cécile Ané, Professor (she/her)
>> > H. I. Romnes Faculty Fellow
>> > Departments of Statistics and of Botany
>> > University of Wisconsin - Madison
>> > www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>
>> >
>> > CALS statistical consulting lab:
>> > https://calslab.cals.wisc.edu/stat-consulting/
>> >
>> >
>> >
>> > On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com<mailto:
>> > neovenatori...@gmail.com> wrote:
>> >
>> > Dear All,
>> >
>> > So this is the main problem I'm facing (see attached figure, which
>> should
>> > be small enough to post). When I calculate the best-fit line under a
>> > Brownian model, this produces a best-fit line that more or less bypasses
>> > the distribution of the data altogether. I did some testing and found
>> that
>> > this result was driven solely by the presence of Monotremata, resulting
>> in
>> > the model heavily downweighting all of the phylogenetic variation within
>> > Theria in favor of the deep divergence between Monotremata and Theria.
>> > Excluding Monotremata produces a PGLS fit that's comparable enough to
>> the
>> > OLS and OU model fit to be justifiable (though I can't just throw out
>> > Monotremata for the sake of throwing it out).
>> >
>> > I am planning to do a more theoretical investigation into the effect of
>> > Monotremata on the PGLS fit in a future study, but right now what I am
>> > trying to do is perform a study in which I use this data to construct a
>> > regression model that can be used to predict new data. Which is why I am
>> > trying to use AIC to potentially justify going with OLS or an OU model
>> over
>> > a Brownian model. From a practical perspective the Brownian model is
>> almost
>> > unusable because it produces systematically biased estimates with high
>> > error rates when applied to new data (error rate is roughly double that
>> of
>> > both the OLS and OU model). This is especially the case because the data
>> > must be back-transformed into an arithmetic scale to be useable, and
>> thus a
>> > seemingly minor difference in regression models results in a massive
>> > difference in predicted values. However, I need some objective test to
>> show
>> > that OLS fits the data better than the Brownian model, hence why I was
>> > going with AIC. Overall, OLS does seem to outperform the Brownian model
>> on
>> > average, but the variation in AIC is so high it is hard to interpret
>> this.
>> >
>> > This is kind of why I am leery of assuming a null Brownian model. A
>> > Brownian model, if anything, does not seem to accurately model the
>> > relationship between variables.
>> >
>> > This is why I am having trouble figuring out how to do model selection.
>> > Just going with accuracy statistics like percent error or standard
>> error of
>> > the estimate OLS is better from a purely practical sense (it doesn't
>> work
>> > for the monotreme taxa, but it turns out that estimate error in the
>> > monotremes is only decreased by 10% in a Brownian model when it
>> > overestimates mass by nearly 75%, so the improvement really isn't worth
>> it
>> > and using this for monotremes isn't recommended in the first place), but
>> > the reviewers are expressing skepticism over the fact that the Brownian
>> > model produces less useable results. And I'm not entirely sure the best
>> way
>> > to go about the PGLS if using one of the birth-death trees isn't ideal,
>> > perhaps what Dr. Upham says about using the DNA tree might work better.
>> >
>> > Ironically, an OU model might be argued to better fit the data, despite
>> > the concerns that Dr. Bapst mentioned. Looking at the distribution of
>> > signal even though signal is not random, it is more accurately
>> described as
>> > most taxa hewing to a stable equilibrium with rapid, high magnitude
>> shifts
>> > at certain evolutionary nodes, rather than the covariation between the
>> two
>> > traits evolving in a Brownian fashion. I did some experiments with a PSR
>> > curve and the results seem to favor an OU model or other models with
>> uneven
>> > rates of evolution rather than a pure Brownian model.
>> >
>> > Of course, the broader issue I am facing is trying to deal with PGLS
>> > succinctly; the scope of the study isn't necessarily an in-depth
>> comparison
>> > between different regression models, it's more looking at how this
>> variable
>> > correlates with body mass for practical purposes (for which considering
>> > phylogeny is one part of that). It's definitely something to consider
>> but I
>> > am trying to avoid manuscript bloat.
>> >
>> > Sincerely,
>> > Russell
>> >
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > R-sig-phylo mailing list - R-sig-phylo@r-project.org
>> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>> > Searchable archive at
>> > http://www.mail-archive.com/r-sig-phylo@r-project.org/
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>> Searchable archive at
>> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Model Selection and PGLS

Reply via email to