Re: [R-sig-phylo] Model Selection and PGLS

2021-07-08 Thread Cecile Ane
Alternatively, you could use Rphylopars 
(https://github.com/ericgoolsby/Rphylopars), so you don’t need to code 
predictions by hand.
I included a reproducible example below, using a single variable. You could use 
multiple variables to take advantage of the correlation between variables for 
better predictions.
Cecile.

input tree and data (20 taxa):

tree <- 
read.tree(text="(((t2:1.07,(((t19:0.06,t20:0.06):0.25,t10:0.31):0.71,t3:1.01):0.05):0.03,(t17:0.13,t18:0.13):0.45,(t13:0.26,t14:0.26):0.32):0.24,((t15:0.19,t16:0.19):0.12,t9:0.31):0.52):0.17,(t6:0.53,t7:0.53):0.46):0.09,t1:1.08):0.02):0.29,(((t8:0.53,(t11:0.29,t12:0.29):0.24):0.01,t5:0.54):0.45,t4:0.99):0.4);")
dat <- data.frame(
  species = 
c("t2","t19","t20","t10","t3","t17","t18","t13","t14","t15","t16","t9","t6","t7","t1","t8","t11","t12","t5","t4"),
  trait = 
c(1.93,0.598,1.559,-0.775,-2.445,2.109,NA,3.465,2.161,-1.474,0.274,-0.026,4.419,2.9,2.795,-2.005,-3.425,-1.711,-1.686,NA)
)
row.names(dat) = dat$species

Now analysis with a Brownian motion (other models are available):

library(Rphylopars)
p_BM <- phylopars(dat, tree)
p_BM$anc_recon[1:20,] # predicted species means
cbind(dat[,1:2], prediction=p_BM$anc_recon[1:20,]) # to compare predictions 
with observations
p_BM$anc_var[1:20,] # variances of each prediction: 0 when observed
p_BM$anc_recon[1:20,] - sqrt(p_BM$anc_var[1:20,])*1.96 # lower end of 95% CI
p_BM$anc_recon[1:20,] + sqrt(p_BM$anc_var[1:20,])*1.96 # upper end of 95% CI


output showing predictions:

species  trait prediction
t2   t2  1.930  1.930
t19 t19  0.598  0.598
...
t18 t18 NA  2.0566536
t13 t13  3.465  3.465
...

On Jul 7, 2021, at 6:33 PM, 
neovenatori...@gmail.com wrote:

Dear All,

Just sending a message to see if my last email went through. The code for
the PGLS still doesn't seem to work: it's giving results contradictory to
what would be expected based on the data (e.g., mu should be extremely
negative for Monotremata, but it's very variable among the three species
and many other species exhibit higher mus in that direction, adding the
correction factor actually makes the prediction worse despite high Pagel's
lambda). I attached a reprex version of my code with the data that
reproduced the result, but I was concerned it didn't go through because
when I've tried to attach files to the r-sig-phylo list previously they are
often unable to go through. I checked the r-sig-phylo archive to see if it
successfully went through, but it isn't listed there, which is one reason I
am concerned my message did not get sent properly.

Just emailing to ask, sorry if this is a bother.

Sincerely,
Russell


[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Model Selection and PGLS

2021-07-07 Thread Russell Engelman
Dear All,

Just sending a message to see if my last email went through. The code for
the PGLS still doesn't seem to work: it's giving results contradictory to
what would be expected based on the data (e.g., mu should be extremely
negative for Monotremata, but it's very variable among the three species
and many other species exhibit higher mus in that direction, adding the
correction factor actually makes the prediction worse despite high Pagel's
lambda). I attached a reprex version of my code with the data that
reproduced the result, but I was concerned it didn't go through because
when I've tried to attach files to the r-sig-phylo list previously they are
often unable to go through. I checked the r-sig-phylo archive to see if it
successfully went through, but it isn't listed there, which is one reason I
am concerned my message did not get sent properly.

Just emailing to ask, sorry if this is a bother.

Sincerely,
Russell

On Fri, Jul 2, 2021 at 8:52 PM Simone Blomberg 
wrote:

> I knew there would be a mistake somewhere. Here is the correction:
>
> ## C <- 1+(lambda-1)*(1-mat) ## Lambda correction WRONG!!
> C <- (lambda*mat)+(1-lambda)*diag(diag(mat)) ## correct, I think..
>
> I tested it on a toy example of a tree with 5 species and it works fine..
>
> I hope there are no other mistakes! See other comments below.
>
> Cheers,
>
> Simone.
> On 3/7/21 8:40 am, Russell Engelman wrote:
>
> Dear All,
>
> Okay, so I tried the code that Dr. Blomberg provided, and it doesn't seem
> to work. I rewrote the code slightly in order for it to try and predict the
> body mass of the taxa used to calculate the line, rather than some new
> taxon, because I need to do that in order to calculate prediction error and
> be able to compare it to previous studies. I got a lambda of about 0.9,
> which is approximately the same as what I got previously. So on the surface
> it seems like the method should work.
>
> The problem comes when I try to apply this to the data itself to calculate
> mu. This is the modified code I used if that helps, though I don't think I
> can send the data or tree to make it replicable because the message
> wouldn't go through last time.
>
> tree.trimmed <- drop.tip(trees[[1]], which(is.na(b$bm)))
> b.trimmed <- b[complete.cases(b$bm),]
> fit <- gls(log(bm)~I(log(ocw)^(2/3)), corPagel(0.5, phy=tree.trimmed,
> form=~species),
>data=b.trimmed)
> (lambda <- as.numeric(fit$modelStruct))
> model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=b)
> preds <- predict(fit,newdata=b)
> mat <- vcv.phylo(trees[[1]])
> C <- 1+(lambda-1)*(1-mat) ## Lambda correction
> XnoMean <- scale(model.matFULL, scale=FALSE)
> mu <- C %*% solve(C) %*%  XnoMean ## Replaced CCompletedata and cihmat
> with C, since both seem to be subsets of that data. This is the only place
> I could think things might go wrong.
>
> C is the full variance-covariance matrix, made from the tree that includes
> both the known and unknown species. CCompletedata is the
> variance-covariance matrix WITHOUT the unknown species. This corresponds to
> C in Ted and Tony's paper.  cihmat is the matrix of relationships of the
> unknown species to the known species, one row for each unknown species. So
> you cannot replace cihmat with C.
>
> mu2<-(mu[match(b$species,row.names(mu)),])[,2] ## Have to re-sort mu based
> on the data sheet. Mu is reported in phylogenetic order, whereas
> predict.gls will always report the data in alphabetical order of the row
> names even if a phylogeny is present.
>
> predict.gls reports the predicted values in the order that is present in
> the newdata argument. You shouldn't have to re-order anything in my example
> code. YMMV however if you adapt my code.
>
> exp(preds+mu2) ## "correct" prediction accounting for bias due to
> correlations with other species. Exp() is because data were log-transformed.
>
> So when I check the results of this equation the predicted values are
> actually *less *accurate than the fitted values reported from gls without
> correction. Normally the accuracy is about +/- 64% of the actual value, but
> adding the correction here results in the data being +/- 84% of the actual
> value. What's more, I took a look at the values of mu and they don't seem
> to make sense when compared on the phylogeny. Namely, if mu were being
> calculated right, one would expect monotremes to have a very negative value
> compared to all other mammals given the deep divergence in trait states
> between monotremes and all other taxa in the figure I previously attached.
> But when I check the mu values, I get a value of 0.35 for *Zaglossus*,
> -0.58 for *Tachyglossus*, and 0.59 for *Ornithorhynchus*. Which is not
> what would be expected if mu were modelling a phylogenetic correction
> factor (if nothing else, the range of mu should probably not span zero). So
> it suggests to me I'm not doing something right. I know that PGLS doesn't
> always result in better fits than OLS, but I'm surprised adding signal back
> 

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-04 Thread Julien Clavel


I think that Cécile' and Theodore' point is important and too often overlooked. 
Using GLS models, the BLUP (Best Linear Unbiased Prediction) is not simply 
obtained from the fitted line but should incorporates information from the 
(evolutionary here) model.

For multivariate linear model you can also do it by specifying a tree including 
both the species used to build the model and the ones you want to predict using 
the “predict” function in mvMORPH (I think that Rphylopars can deal with 
multivariate phylogenetic regression too).

Regarding the model comparison, I would simply avoid it (or limit it) by 
fitting models flexible enough to accommodate between your BM and OLS case and 
summarize the results obtained across all the trees…

Julien


De : R-sig-phylo  de la part de Theodore 
Garland 
Envoyé : mercredi 30 juin 2021 03:26
À : Cecile Ane 
Cc : mailman, r-sig-phylo ; neovenatori...@gmail.com 

Objet : Re: [R-sig-phylo] Model Selection and PGLS 
 
All true.  I would just add two things.  First, always graph your data and
do ordinary OLS analyses as a reality check.

Second, I think this is the original paper for phylogenetic prediction:
Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
present: confidence intervals for regression equations in phylogenetic
comparative methods. American Naturalist 155:346–364.
There, we talk about the Equivalency of the Independent-Contrasts and
Generalized Least Squares Approaches.

Cheers,
Ted


On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:

> Hi Russel,
>
> What you see is the large uncertainty in “ancestral” states, which is part
> of the intercept here. The linear relationship that you overlaid on top of
> your data is the relationship predicted at the root of the tree (as if such
> a thing existed!). There is a lot of uncertainty about the intercept, but
> much less uncertainty in the slope. It looks like the slope is not affected
> by the inclusion or exclusion of monotremes. (for one possible reference on
> the greater precision in the slope versus the intercept, there’s this:
> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>
> My second cent is that the phylogenetic predictions should be stable. The
> uncertainty in the intercept —and the large effect of including monotremes
> on the intercept— should not affect predictions, so long as you know for
> which species you want to make a prediction. If you want to make prediction
> for a species in a small clade “far” from monotremes, say, then the
> prediction is probably quite stable, even if you include monotremes: this
> is because the phylogenetic prediction should use the phylogenetic
> relationships for the species to be predicted. A prediction that uses the
> linear relationship at the root and ignores the placement of the species
> would be the worst-case scenario: for a mammal species with a completely
> unknown placement within mammals.
>
> There’s probably a number of software that do phylogenetic prediction. I
> know of Rphylopars and PhyloNetworks.
>
> my 2 cents…
> Cecile
>
> ---
> Cécile Ané, Professor (she/her)
> H. I. Romnes Faculty Fellow
> Departments of Statistics and of Botany
> University of Wisconsin - Madison
> www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>
>
> CALS statistical consulting lab:
> https://calslab.cals.wisc.edu/stat-consulting/
>
>
>
> On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com neovenatori...@gmail.com> wrote:
>
> Dear All,
>
> So this is the main problem I'm facing (see attached figure, which should
> be small enough to post). When I calculate the best-fit line under a
> Brownian model, this produces a best-fit line that more or less bypasses
> the distribution of the data altogether. I did some testing and found that
> this result was driven solely by the presence of Monotremata, resulting in
> the model heavily downweighting all of the phylogenetic variation within
> Theria in favor of the deep divergence between Monotremata and Theria.
> Excluding Monotremata produces a PGLS fit that's comparable enough to the
> OLS and OU model fit to be justifiable (though I can't just throw out
> Monotremata for the sake of throwing it out).
>
> I am planning to do a more theoretical investigation into the effect of
> Monotremata on the PGLS fit in a future study, but right now what I am
> trying to do is perform a study in which I use this data to construct a
> regression model that can be used to predict new data. Which is why I am
> trying to use AIC to potentially justify going with OLS or an OU model over
> a Brownian model. From a practical perspective the Brownian model is almost
> unusable because it produces systematically biased estimates with high
> error rates when applied to new data (error rate is roughly double that of
&

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-04 Thread Simone Blomberg
 discussed it already. I remember talking with
my colleagues a lot about "isn't there some way to incorporate phylogenetic
information back into the model to improve accuracy of the prediction if we
know where the taxon is positioned?" and they just thought there wasn't a
way.

Regarding the model comparison, I would simply avoid it (or limit it) by

fitting models flexible enough to accommodate between your BM and OLS case
and summarize the results obtained across all the trees…


I am not entirely sure what is meant here. Do you mean fitting both an OLS
and BM model and comparing both models? I am reporting both, but my concern
is about which model I report is the best one to use going forward, since
the BM model is seemingly less accurate (though I am just taking the fitted
values from the PGLS model, which I don't think include phylogenetic
information). The two models I use produce dramatically different results,
for example the BM model produces body mass estimates which are 25% larger
than OLS.

Right now PGLS is something I would avoid if I had the option (if for no
other reason than not put all of the analyses in a single, overloaded
manuscript [the manuscript is already about 90 pages] and deviate from the
scope of the study), but I'm sure you know that most regression analyses
nowadays require some sort of preliminary PCM to be acceptable.

Sincerely,
Russell

On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel 
wrote:


I think that Cécile' and Theodore' point is important and too often
overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
not simply obtained from the fitted line but should incorporates
information from the (evolutionary here) model.

For multivariate linear model you can also do it by specifying a tree
including both the species used to build the model and the ones you want to
predict using the “predict” function in mvMORPH (I think that Rphylopars
can deal with multivariate phylogenetic regression too).

Regarding the model comparison, I would simply avoid it (or limit it) by
fitting models flexible enough to accommodate between your BM and OLS case
and summarize the results obtained across all the trees…

Julien


De : R-sig-phylo  de la part de
Theodore Garland 
Envoyé : mercredi 30 juin 2021 03:26
À : Cecile Ane 
Cc : mailman, r-sig-phylo ;
neovenatori...@gmail.com 
Objet : Re: [R-sig-phylo] Model Selection and PGLS

All true.  I would just add two things.  First, always graph your data and
do ordinary OLS analyses as a reality check.

Second, I think this is the original paper for phylogenetic prediction:
Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
present: confidence intervals for regression equations in phylogenetic
comparative methods. American Naturalist 155:346–364.
There, we talk about the Equivalency of the Independent-Contrasts and
Generalized Least Squares Approaches.

Cheers,
Ted


On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:


Hi Russel,

What you see is the large uncertainty in “ancestral” states, which is

part

of the intercept here. The linear relationship that you overlaid on top

of

your data is the relationship predicted at the root of the tree (as if

such

a thing existed!). There is a lot of uncertainty about the intercept,

but

much less uncertainty in the slope. It looks like the slope is not

affected

by the inclusion or exclusion of monotremes. (for one possible

reference on

the greater precision in the slope versus the intercept, there’s this:
http://dx.doi.org/10.1214/13-AOS1105 for the BM).

My second cent is that the phylogenetic predictions should be stable.

The

uncertainty in the intercept —and the large effect of including

monotremes

on the intercept— should not affect predictions, so long as you know for
which species you want to make a prediction. If you want to make

prediction

for a species in a small clade “far” from monotremes, say, then the
prediction is probably quite stable, even if you include monotremes:

this

is because the phylogenetic prediction should use the phylogenetic
relationships for the species to be predicted. A prediction that uses

the

linear relationship at the root and ignores the placement of the species
would be the worst-case scenario: for a mammal species with a completely
unknown placement within mammals.

There’s probably a number of software that do phylogenetic prediction. I
know of Rphylopars and PhyloNetworks.

my 2 cents…
Cecile

---
Cécile Ané, Professor (she/her)
H. I. Romnes Faculty Fellow
Departments of Statistics and of Botany
University of Wisconsin - Madison
www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>

CALS statistical consulting lab:
https://calslab.cals.wisc.edu/stat-consulting/



On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com wrote:

Dear All,

So this is the main problem I'm facing (see attached figure, which

should

be small enough to post). When I calculate the best-fit line under a
Brownian model, thi

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-04 Thread Julien Clavel


"There’s a way to impute phylogenetic signal back into a PGLS model? I am super 
surprised at that. I’ve talked to at least three different colleagues who use 
PGLS about this issue, and all of them had told me that there is no way to 
input phylogenetic signal back into the model for new data points and I should 
just go with the single regression line the model gives me (i.e., the 
regression line for the ancestral node). "

=> Not only it's possible, but this also needed to obtain the best predictions. 
The use of GLS in prediction date back to the early 60's at least (e.g. 
Goldberger 1962). You can find that in any books about GLS too. Theodore 
pointed to Garland & Ives 2000 for their use in a phylogenetic context, you 
have also Martins & Hansen 1997 and probably others. This is tightly related to 
methods used for ancestral state reconstructions (e.g. Goolsby et al. 2017) and 
is readily implemented in at least PhyloNetworks, Rphylopars and mvMORPH.

"I am not entirely sure what is meant here. Do you mean fitting both an OLS and 
BM model and comparing both models?"

=> I mean that using "OU" or "Pagel's lambda" as suggested in previous comments 
(or even more flexible or realistic models!), will be probably better than 
using a dichotomous decision between "no signal" or "BM" because they can 
accomodate both scenario but also some intermediate ones. This is also 
important for hypothesis testing, see for instance Revell 2010 and Clavel & 
Morlon 2020

"though I am just taking the fitted values from the PGLS model, which I don't 
think include phylogenetic information"

=> Again, this is less optimal (probably worse than OLS!) unless the 
phylogenetic model is used in the prediction (see the references above).

Cheers,

Julien

De : Russell Engelman 
Envoyé : jeudi 1 juillet 2021 06:20
À : Julien Clavel 
Cc : Theodore Garland ; Cecile Ane 
; mailman, r-sig-phylo 
Objet : Re: [R-sig-phylo] Model Selection and PGLS 
 
Dear All,

What you see is the large uncertainty in “ancestral” states, which is part of 
the intercept here. The linear relationship that you overlaid on top of your 
data is the relationship predicted at the root of the tree (as if such a thing 
existed!). There is a lot of uncertainty about the intercept, but much less 
uncertainty in the slope. It looks like the slope is not affected by the 
inclusion or exclusion of monotremes. (for one possible reference on the 
greater precision in the slope versus the intercept, there’s this: 
http://dx.doi.org/10.1214/13-AOS1105 for the BM).

Yes, that sounds right from the other data I have. The line approximates what 
would be expected for the root of Mammalia, and the signal in the PGLS is more 
due to shifts in the y-intercept than shifts in slope, which in turn is 
supported by the anatomy of the proxy.

My second cent is that the phylogenetic predictions should be stable. The 
uncertainty in the intercept —and the large effect of including monotremes on 
the intercept— should not affect predictions, so long as you know for which 
species you want to make a prediction. If you want to make prediction for a 
species in a small clade “far” from monotremes, say, then the prediction is 
probably quite stable, even if you include monotremes: this is because the 
phylogenetic prediction should use the phylogenetic relationships for the 
species to be predicted. A prediction that uses the linear relationship at the 
root and ignores the placement of the species would be the worst-case scenario: 
for a mammal species with a completely unknown placement within mammals.

This is what I'm a bit confused about. I was always told (and it seemingly 
implies this in some of the PGLS literature I read like Rohlf 2011 and Smaers 
and Rohlf 2016) that it isn't possible to include phylogenetic data from the 
new data points into the prediction in order to improve predictions. I'm a 
little confused as to whether it's possible or not (see below).

There’s probably a number of software that do phylogenetic prediction. I know 
of Rphylopars and PhyloNetworks.
 
I will take a look into those.

I think that Cécile' and Theodore' point is important and too often overlooked. 
Using GLS models, the BLUP (Best Linear Unbiased Prediction) is not simply 
obtained from the fitted line but should incorporates information from the 
(evolutionary here) model.

     There’s a way to impute phylogenetic signal back into a PGLS model? I am 
super surprised at that. I’ve talked to at least three different colleagues who 
use PGLS about this issue, and all of them had told me that there is no way to 
input phylogenetic signal back into the model for new data points and I should 
just go with the single regression line the model gives me (i.e., the 
regression line for the ancestral node). 

     I tried looking around to see what previous researchers used when using 
PCM on b

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-04 Thread Simone Blomberg
t that. I’ve talked to at least three different
colleagues who use PGLS about this issue, and all of them had told me that
there is no way to input phylogenetic signal back into the model for new
data points and I should just go with the single regression line the model
gives me (i.e., the regression line for the ancestral node).

  I tried looking around to see what previous researchers used when
using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just went
with the best fit line with the ancestral node, i.e., looking at their
reported results they give a simple trait~predictor equation that does not
include phylogeny when calculating new data. Campion and Evans 2012 used
PIC versus PGLS, which I know are technically equivalent but it doesn't
seem like they included phylogenetic information when they predicted new
data: they used their equations on dinosaurs but there are no dinosaurs in
the tree they used. I know that it’s possible to incorporate phylogenetic
signal into the new data using PVR but PVR has been criticized for other
reasons.

 This is something that seems really, really concerning because if
there is a method of using phylogenetic covariance to adjust the position
of new data points it seems like a lot of workers don’t know these methods
exist, to the point that even published papers overlook it. This was
something I was hoping to highlight in a later paper on the data, but it
sounds like people might have discussed it already. I remember talking with
my colleagues a lot about "isn't there some way to incorporate phylogenetic
information back into the model to improve accuracy of the prediction if we
know where the taxon is positioned?" and they just thought there wasn't a
way.

Regarding the model comparison, I would simply avoid it (or limit it) by

fitting models flexible enough to accommodate between your BM and OLS case
and summarize the results obtained across all the trees…


I am not entirely sure what is meant here. Do you mean fitting both an OLS
and BM model and comparing both models? I am reporting both, but my concern
is about which model I report is the best one to use going forward, since
the BM model is seemingly less accurate (though I am just taking the fitted
values from the PGLS model, which I don't think include phylogenetic
information). The two models I use produce dramatically different results,
for example the BM model produces body mass estimates which are 25% larger
than OLS.

Right now PGLS is something I would avoid if I had the option (if for no
other reason than not put all of the analyses in a single, overloaded
manuscript [the manuscript is already about 90 pages] and deviate from the
scope of the study), but I'm sure you know that most regression analyses
nowadays require some sort of preliminary PCM to be acceptable.

Sincerely,
Russell

On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel 
wrote:


I think that Cécile' and Theodore' point is important and too often
overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
not simply obtained from the fitted line but should incorporates
information from the (evolutionary here) model.

For multivariate linear model you can also do it by specifying a tree
including both the species used to build the model and the ones you want to
predict using the “predict” function in mvMORPH (I think that Rphylopars
can deal with multivariate phylogenetic regression too).

Regarding the model comparison, I would simply avoid it (or limit it) by
fitting models flexible enough to accommodate between your BM and OLS case
and summarize the results obtained across all the trees…

Julien


De : R-sig-phylo  de la part de
Theodore Garland 
Envoyé : mercredi 30 juin 2021 03:26
À : Cecile Ane 
Cc : mailman, r-sig-phylo ;
neovenatori...@gmail.com 
Objet : Re: [R-sig-phylo] Model Selection and PGLS

All true.  I would just add two things.  First, always graph your data and
do ordinary OLS analyses as a reality check.

Second, I think this is the original paper for phylogenetic prediction:
Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
present: confidence intervals for regression equations in phylogenetic
comparative methods. American Naturalist 155:346–364.
There, we talk about the Equivalency of the Independent-Contrasts and
Generalized Least Squares Approaches.

Cheers,
Ted


On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:


Hi Russel,

What you see is the large uncertainty in “ancestral” states, which is

part

of the intercept here. The linear relationship that you overlaid on top

of

your data is the relationship predicted at the root of the tree (as if

such

a thing existed!). There is a lot of uncertainty about the intercept,

but

much less uncertainty in the slope. It looks like the slope is not

affected

by the inclusion or exclusion of monotremes. (for one possible

reference on

the great

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-04 Thread Theodore Garland
be predicted. A prediction that uses
> the
> >>> linear relationship at the root and ignores the placement of the
> species
> >>> would be the worst-case scenario: for a mammal species with a
> completely
> >>> unknown placement within mammals.
> >>
> >> This is what I'm a bit confused about. I was always told (and it
> seemingly
> >> implies this in some of the PGLS literature I read like Rohlf 2011 and
> >> Smaers and Rohlf 2016) that it isn't possible to include phylogenetic
> data
> >> from the new data points into the prediction in order to improve
> >> predictions. I'm a little confused as to whether it's possible or not
> (see
> >> below).
> >>
> >> There’s probably a number of software that do phylogenetic prediction. I
> >>> know of Rphylopars and PhyloNetworks.
> >>
> >> I will take a look into those.
> >>
> >> I think that Cécile' and Theodore' point is important and too often
> >>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased
> Prediction) is
> >>> not simply obtained from the fitted line but should incorporates
> >>> information from the (evolutionary here) model.
> >>
> >>   There’s a way to impute phylogenetic signal back into a PGLS
> model? I
> >> am super surprised at that. I’ve talked to at least three different
> >> colleagues who use PGLS about this issue, and all of them had told me
> that
> >> there is no way to input phylogenetic signal back into the model for new
> >> data points and I should just go with the single regression line the
> model
> >> gives me (i.e., the regression line for the ancestral node).
> >>
> >>   I tried looking around to see what previous researchers used when
> >> using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and
> >> Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just
> went
> >> with the best fit line with the ancestral node, i.e., looking at their
> >> reported results they give a simple trait~predictor equation that does
> not
> >> include phylogeny when calculating new data. Campion and Evans 2012 used
> >> PIC versus PGLS, which I know are technically equivalent but it doesn't
> >> seem like they included phylogenetic information when they predicted new
> >> data: they used their equations on dinosaurs but there are no dinosaurs
> in
> >> the tree they used. I know that it’s possible to incorporate
> phylogenetic
> >> signal into the new data using PVR but PVR has been criticized for other
> >> reasons.
> >>
> >>  This is something that seems really, really concerning because if
> >> there is a method of using phylogenetic covariance to adjust the
> position
> >> of new data points it seems like a lot of workers don’t know these
> methods
> >> exist, to the point that even published papers overlook it. This was
> >> something I was hoping to highlight in a later paper on the data, but it
> >> sounds like people might have discussed it already. I remember talking
> with
> >> my colleagues a lot about "isn't there some way to incorporate
> phylogenetic
> >> information back into the model to improve accuracy of the prediction
> if we
> >> know where the taxon is positioned?" and they just thought there wasn't
> a
> >> way.
> >>
> >> Regarding the model comparison, I would simply avoid it (or limit it) by
> >>> fitting models flexible enough to accommodate between your BM and OLS
> case
> >>> and summarize the results obtained across all the trees…
> >>
> >> I am not entirely sure what is meant here. Do you mean fitting both an
> OLS
> >> and BM model and comparing both models? I am reporting both, but my
> concern
> >> is about which model I report is the best one to use going forward,
> since
> >> the BM model is seemingly less accurate (though I am just taking the
> fitted
> >> values from the PGLS model, which I don't think include phylogenetic
> >> information). The two models I use produce dramatically different
> results,
> >> for example the BM model produces body mass estimates which are 25%
> larger
> >> than OLS.
> >>
> >> Right now PGLS is something I would avoid if I had the option (if for no
> >> other reason than not put all of the analyses in a single, overloaded
> >> manuscript [the manuscript is already about 90 pages] and deviate from
> the
> >

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-03 Thread Russell Engelman
>
> C is the full variance-covariance matrix, made from the tree that includes
> both the known and unknown species. CCompletedata is the
> variance-covariance matrix WITHOUT the unknown species. This corresponds to
> C in Ted and Tony's paper.  cihmat is the matrix of relationships of the
> unknown species to the known species, one row for each unknown species. So
> you cannot replace cihmat with C.


So how would one go about double-checking the original data to see how well
the fitted values for the PGLS model predict the known original values if
cihmat is not replaceable with C? What would be really nice is if there was
some way to return the values for all the species, both new species and old
ones, and then create a separate column using the call
"ifelse(is.na(dat$y)==TRUE,TRUE,FALSE)"
that could be used to call either the old data or the new ones.

I've been trying the code but there still seems to be some unusual results
when I apply it. I made a dummy version of my dataset with a random set of
10 taxa being recorded as NA for the dependent variable, but adding the
corrections for signal still doesn't seem to be working. Adding the
corrections for signal produces identical error rates, and in general OLS
seems to produce more accurate estimates, which is weird given how large
lambda is. See code below along with attached data (though I reran this
with multiple seeds to see if I got the same results).

library(ape);library(nlme);library(caper);library(tidyverse)
set.seed(20210703)

dat<-read.csv("data.csv",header=T)
row.names(dat)<-dat$species
tree<-read.tree("tree.tre")

fit.ols<-lm(log(bm)~I(log(ocw)^(2/3)),data=dat)

dat$species2=dat$species
dat.comp<-comparative.data(tree,dat,names.col=species2)
dat.comp$data$bm[sample(length(dat.comp$data$bm), 10)] <- NA

dat.comp$data[is.na(dat.comp$data$bm)==TRUE,]

tree.trimmed <- drop.tip(tree, which(is.na(dat.comp$data$bm)))
dat.comp$data.trimmed <- dat.comp$data[complete.cases(dat.comp$data$bm),]
gls(log(bm)~I(log(ocw)^(2/3)),
   correlation=corPagel(0.5, phy=tree.trimmed, form=~species),
   
data=dat%>%filter(species!=dat.comp$data[is.na(dat.comp$data$bm)==TRUE,]$species))
#For some reason tree.trimmed does not work even though the two groups
appear to have the same tips.
gls(log(bm)~I(log(ocw)^(2/3)),
   correlation=corPagel(0.5, phy=tree, form=~species),
   data=dat)
fit <- gls(log(bm)~I(log(ocw)^(2/3)), correlation=corPagel(0.5, phy=tree,
form=~species),
   data=dat.comp$data.trimmed)
(lambda <- as.numeric(fit$modelStruct))

model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=dat.comp$data) ##
design matrix for model

missvals <- which(is.na(dat.comp$data$bm))
nmiss <- length(missvals)
preds <- predict(fit, newdata=dat.comp$data[missvals,]) ## get predictions
from model, assuming species are at the root.
mat <- vcv.phylo(tree)
C <- (lambda*mat)+(1-lambda)*diag(diag(mat)) ## Lambda correction

CCompleteData <- C[-missvals, -missvals] ## remove the missing species
cihmat <- C[missvals, -missvals] ## get the relationships of the "new"
species to the previous ones.

XnoMean <- scale(model.matFULL[-missvals,], scale=FALSE)

(mu <- cihmat %*% solve(CCompleteData) %*%  XnoMean)

preds+mu[,2]
(data.check<-data.frame(actual.name=dat.comp$data$species[missvals],
  actual=c(dat$bm[dat$species==row.names(mu)[1]],
   dat$bm[dat$species==row.names(mu)[2]],
   dat$bm[dat$species==row.names(mu)[3]],
   dat$bm[dat$species==row.names(mu)[4]],
   dat$bm[dat$species==row.names(mu)[5]],
   dat$bm[dat$species==row.names(mu)[6]],
   dat$bm[dat$species==row.names(mu)[7]],
   dat$bm[dat$species==row.names(mu)[8]],
   dat$bm[dat$species==row.names(mu)[9]],
   dat$bm[dat$species==row.names(mu)[10]]),
  predicted.ols=c(exp(fit.ols$fitted.values[row.names(mu)[1]]),
  exp(fit.ols$fitted.values[row.names(mu)[2]]),
  exp(fit.ols$fitted.values[row.names(mu)[3]]),
  exp(fit.ols$fitted.values[row.names(mu)[4]]),
  exp(fit.ols$fitted.values[row.names(mu)[5]]),
  exp(fit.ols$fitted.values[row.names(mu)[6]]),
  exp(fit.ols$fitted.values[row.names(mu)[7]]),
  exp(fit.ols$fitted.values[row.names(mu)[8]]),
  exp(fit.ols$fitted.values[row.names(mu)[9]]),
  exp(fit.ols$fitted.values[row.names(mu)[10]])),
  predicted.names=dat.comp$data[missvals,]$species,
  predicted.orig=exp(preds),mu.names=row.names(mu),
  predicted.corr=exp(preds+mu[,2]))%>%
  mutate(across(where(is.numeric), round, 2)))
mean(abs(data.check$predicted.ols-data.check$actual)/data.check$predicted.ols)
#%PE with corrected OLS
mean(abs(data.check$predicted.orig-data.check$actual)/data.check$predicted.orig)
#%PE with uncorrected PGLS

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-02 Thread Simone Blomberg
I knew there would be a mistake somewhere. Here is the correction:

## C <- 1+(lambda-1)*(1-mat) ## Lambda correction WRONG!!
C <- (lambda*mat)+(1-lambda)*diag(diag(mat)) ## correct, I think..

I tested it on a toy example of a tree with 5 species and it works fine..

I hope there are no other mistakes! See other comments below.

Cheers,

Simone.

On 3/7/21 8:40 am, Russell Engelman wrote:
> Dear All,
>
> Okay, so I tried the code that Dr. Blomberg provided, and it doesn't 
> seem to work. I rewrote the code slightly in order for it to try and 
> predict the body mass of the taxa used to calculate the line, rather 
> than some new taxon, because I need to do that in order to calculate 
> prediction error and be able to compare it to previous studies. I got 
> a lambda of about 0.9, which is approximately the same as what I got 
> previously. So on the surface it seems like the method should work.
>
> The problem comes when I try to apply this to the data itself to 
> calculate mu. This is the modified code I used if that helps, though I 
> don't think I can send the data or tree to make it replicable because 
> the message wouldn't go through last time.
>
> tree.trimmed <- drop.tip(trees[[1]], which(is.na (b$bm)))
> b.trimmed <- b[complete.cases(b$bm),]
> fit <- gls(log(bm)~I(log(ocw)^(2/3)), corPagel(0.5, phy=tree.trimmed, 
> form=~species),
>            data=b.trimmed)
> (lambda <- as.numeric(fit$modelStruct))
> model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=b)
> preds <- predict(fit,newdata=b)
> mat <- vcv.phylo(trees[[1]])
> C <- 1+(lambda-1)*(1-mat) ## Lambda correction
> XnoMean <- scale(model.matFULL, scale=FALSE)
> mu <- C %*% solve(C) %*%  XnoMean ## Replaced CCompletedata and cihmat 
> with C, since both seem to be subsets of that data. This is the only 
> place I could think things might go wrong.

C is the full variance-covariance matrix, made from the tree that 
includes both the known and unknown species. CCompletedata is the 
variance-covariance matrix WITHOUT the unknown species. This corresponds 
to C in Ted and Tony's paper.  cihmat is the matrix of relationships of 
the unknown species to the known species, one row for each unknown 
species. So you cannot replace cihmat with C.

> mu2<-(mu[match(b$species,row.names(mu)),])[,2] ## Have to re-sort mu 
> based on the data sheet. Mu is reported in phylogenetic order, whereas 
> predict.gls will always report the data in alphabetical order of the 
> row names even if a phylogeny is present.
predict.gls reports the predicted values in the order that is present in 
the newdata argument. You shouldn't have to re-order anything in my 
example code. YMMV however if you adapt my code.
> exp(preds+mu2) ## "correct" prediction accounting for bias due to 
> correlations with other species. Exp() is because data were 
> log-transformed.
>
> So when I check the results of this equation the predicted values are 
> actually /less /accurate than the fitted values reported from gls 
> without correction. Normally the accuracy is about +/- 64% of the 
> actual value, but adding the correction here results in the data 
> being +/- 84% of the actual value. What's more, I took a look at the 
> values of mu and they don't seem to make sense when compared on the 
> phylogeny. Namely, if mu were being calculated right, one would expect 
> monotremes to have a very negative value compared to all other mammals 
> given the deep divergence in trait states between monotremes and all 
> other taxa in the figure I previously attached. But when I check the 
> mu values, I get a value of 0.35 for /Zaglossus/, -0.58 for 
> /Tachyglossus/, and 0.59 for /Ornithorhynchus/. Which is not what 
> would be expected if mu were modelling a phylogenetic correction 
> factor (if nothing else, the range of mu should probably not span 
> zero). So it suggests to me I'm not doing something right. I know that 
> PGLS doesn't always result in better fits than OLS, but I'm surprised 
> adding signal back in made the results worse when the fit was so poor 
> when it was fitted through the ancestral node.
>
> Sincerely,
> Russell

-- 
Simone Blomberg, BSc (Hons), PhD, MAppStat (she/her)
Senior Lecturer and Consultant Statistician
School of Biological Sciences
The University of Queensland
St. Lucia Queensland 4072
Australia
T: +61 7 3365 2506
email: S.Blomberg1_at_uq.edu.au
Twitter: @simoneb66
UQ ALLY Supporting the diversity of sexuality and gender at UQ.

Policies:
1.  I will NOT analyse your data for you.
2.  Your deadline is your problem.

Basically, I'm not interested in doing research
and I never have been. I'm interested in
understanding, which is quite a different thing.
And often to understand something you have to
work it out for yourself because no one else
has done it. - David Blackwell


[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-02 Thread Russell Engelman
Dear All,

Okay, so I tried the code that Dr. Blomberg provided, and it doesn't seem
to work. I rewrote the code slightly in order for it to try and predict the
body mass of the taxa used to calculate the line, rather than some new
taxon, because I need to do that in order to calculate prediction error and
be able to compare it to previous studies. I got a lambda of about 0.9,
which is approximately the same as what I got previously. So on the surface
it seems like the method should work.

The problem comes when I try to apply this to the data itself to calculate
mu. This is the modified code I used if that helps, though I don't think I
can send the data or tree to make it replicable because the message
wouldn't go through last time.

tree.trimmed <- drop.tip(trees[[1]], which(is.na(b$bm)))
b.trimmed <- b[complete.cases(b$bm),]
fit <- gls(log(bm)~I(log(ocw)^(2/3)), corPagel(0.5, phy=tree.trimmed,
form=~species),
   data=b.trimmed)
(lambda <- as.numeric(fit$modelStruct))
model.matFULL <- model.matrix(~I(log(ocw)^(2/3)), data=b)
preds <- predict(fit,newdata=b)
mat <- vcv.phylo(trees[[1]])
C <- 1+(lambda-1)*(1-mat) ## Lambda correction
XnoMean <- scale(model.matFULL, scale=FALSE)
mu <- C %*% solve(C) %*%  XnoMean ## Replaced CCompletedata and cihmat with
C, since both seem to be subsets of that data. This is the only place I
could think things might go wrong.
mu2<-(mu[match(b$species,row.names(mu)),])[,2] ## Have to re-sort mu based
on the data sheet. Mu is reported in phylogenetic order, whereas
predict.gls will always report the data in alphabetical order of the row
names even if a phylogeny is present.
exp(preds+mu2) ## "correct" prediction accounting for bias due to
correlations with other species. Exp() is because data were log-transformed.

So when I check the results of this equation the predicted values are
actually *less *accurate than the fitted values reported from gls without
correction. Normally the accuracy is about +/- 64% of the actual value, but
adding the correction here results in the data being +/- 84% of the actual
value. What's more, I took a look at the values of mu and they don't seem
to make sense when compared on the phylogeny. Namely, if mu were being
calculated right, one would expect monotremes to have a very negative value
compared to all other mammals given the deep divergence in trait states
between monotremes and all other taxa in the figure I previously attached.
But when I check the mu values, I get a value of 0.35 for *Zaglossus*,
-0.58 for *Tachyglossus*, and 0.59 for *Ornithorhynchus*. Which is not what
would be expected if mu were modelling a phylogenetic correction factor (if
nothing else, the range of mu should probably not span zero). So it
suggests to me I'm not doing something right. I know that PGLS doesn't
always result in better fits than OLS, but I'm surprised adding signal back
in made the results worse when the fit was so poor when it was fitted
through the ancestral node.

Sincerely,
Russell

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Model Selection and PGLS

2021-07-02 Thread Chris Organ
on’t know these methods
> > exist, to the point that even published papers overlook it. This was
> > something I was hoping to highlight in a later paper on the data, but it
> > sounds like people might have discussed it already. I remember talking with
> > my colleagues a lot about "isn't there some way to incorporate phylogenetic
> > information back into the model to improve accuracy of the prediction if we
> > know where the taxon is positioned?" and they just thought there wasn't a
> > way.
> >
> > Regarding the model comparison, I would simply avoid it (or limit it) by
> >> fitting models flexible enough to accommodate between your BM and OLS case
> >> and summarize the results obtained across all the trees…
> >
> >
> > I am not entirely sure what is meant here. Do you mean fitting both an OLS
> > and BM model and comparing both models? I am reporting both, but my concern
> > is about which model I report is the best one to use going forward, since
> > the BM model is seemingly less accurate (though I am just taking the fitted
> > values from the PGLS model, which I don't think include phylogenetic
> > information). The two models I use produce dramatically different results,
> > for example the BM model produces body mass estimates which are 25% larger
> > than OLS.
> >
> > Right now PGLS is something I would avoid if I had the option (if for no
> > other reason than not put all of the analyses in a single, overloaded
> > manuscript [the manuscript is already about 90 pages] and deviate from the
> > scope of the study), but I'm sure you know that most regression analyses
> > nowadays require some sort of preliminary PCM to be acceptable.
> >
> > Sincerely,
> > Russell
> >
> > On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel 
> > wrote:
> >
> >> I think that Cécile' and Theodore' point is important and too often
> >> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
> >> not simply obtained from the fitted line but should incorporates
> >> information from the (evolutionary here) model.
> >>
> >> For multivariate linear model you can also do it by specifying a tree
> >> including both the species used to build the model and the ones you want to
> >> predict using the “predict” function in mvMORPH (I think that Rphylopars
> >> can deal with multivariate phylogenetic regression too).
> >>
> >> Regarding the model comparison, I would simply avoid it (or limit it) by
> >> fitting models flexible enough to accommodate between your BM and OLS case
> >> and summarize the results obtained across all the trees…
> >>
> >> Julien
> >>
> >>
> >> De : R-sig-phylo  de la part de
> >> Theodore Garland 
> >> Envoyé : mercredi 30 juin 2021 03:26
> >> À : Cecile Ane 
> >> Cc : mailman, r-sig-phylo ;
> >> neovenatori...@gmail.com 
> >> Objet : Re: [R-sig-phylo] Model Selection and PGLS
> >>
> >> All true.  I would just add two things.  First, always graph your data and
> >> do ordinary OLS analyses as a reality check.
> >>
> >> Second, I think this is the original paper for phylogenetic prediction:
> >> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
> >> present: confidence intervals for regression equations in phylogenetic
> >> comparative methods. American Naturalist 155:346–364.
> >> There, we talk about the Equivalency of the Independent-Contrasts and
> >> Generalized Least Squares Approaches.
> >>
> >> Cheers,
> >> Ted
> >>
> >>
> >> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:
> >>
> >> > Hi Russel,
> >> >
> >> > What you see is the large uncertainty in “ancestral” states, which is
> >> part
> >> > of the intercept here. The linear relationship that you overlaid on top
> >> of
> >> > your data is the relationship predicted at the root of the tree (as if
> >> such
> >> > a thing existed!). There is a lot of uncertainty about the intercept,
> >> but
> >> > much less uncertainty in the slope. It looks like the slope is not
> >> affected
> >> > by the inclusion or exclusion of monotremes. (for one possible
> >> reference on
> >> > the greater precision in the slope versus the intercept, there’s this:
> >> > http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >> >
> >> > My second cent 

Re: [R-sig-phylo] Model Selection and PGLS

2021-07-01 Thread Russell Engelman
Dear All,

Okay, I've read all of the papers that people have linked, and I have an
understanding of the theory. However, where I am struggling now is trying
to figure out how to actually perform a PGLS that incorporates signal when
predicting new data in R. I tested a bunch of gls functions in different R
packages, including *pgls* in *caper, phylolm* in *phylolm*, and *gls* in
*nlme*. I tried mvgls in *mvMORPH *but I got the error message saying to
try *gls *in *nlme* because *mvMORPH *requires multivariate data. In all of
these cases it doesn't appear as though the functions incorporate
phylogenetic information back into the PGLS model to make predictions. For
body mass estimation equations model accuracy is usually evaluated by
calculating the percent error between the fitted values for the data and
the actual data values, since prediction is the goal and log-transformation
can make other parameters like r2 more difficult to use (though I know r2
isn't usually a thing when it comes to PGLS).

However, when I calculated percent error, I noticed that the fitted values
of the function seem to be predicted through the ancestral node, *not *by
adjusting the prediction using phylogenetic information. This appears to be
the case regardless of whether fitted values were called directly from the
function (e.g., "fit.gls$fitted.values") or by predicting values of the
original data using *predict *(i.e., "predict(fit.gls,data)"). The results
of the two methods were identical when I used *phylolm *and *gls*, but
differed slightly between using predict or fitted values in *caper *(but
not enough to conclude that one method did incorporate phylogenetic signal,
just enough that it's clear they aren't identical). Furthermore, I also
tried manually estimating point values using the line fit through the
ancestral node and the values produced were identical to the fitted values
reported in these other methods. Whatever I'm doing, it's  not
incorporating phylogenetic signal to produce estimates.

I haven't tried the Bayesian methodology in Organ et al. (2007) yet, mostly
because I'm not familiar with BayesTraits and am trying to keep all my
analyses in R if possible as I have an RMarkdown document that I am
documenting my method as supplementary information. I may have to try
BayesTraits if there aren't any other options. I'm also not sure if PIC
would be easier than trying to mess with Bayesian models, though I know PIC
and PGLS produce more or less identical results.

I am surprised that researchers don’t know about these methods: they are
> basically just like ancestral state reconstruction, but for new tips in the
> tree. Everyone knows about “ancestral state reconstruction”. But if few
> people recognize that the same methodology can be applied to new taxa, then
> it’s a problem indeed.


I had numerous debates with colleagues and other researchers in my
department about this exact topic, and it's nice to know the arguments I
made of "if you already have a variance- covariance matrix with known taxa,
why can't you just adjust estimated based on phylogenetic data if you have
a resolved phylogeny" weren't completely nuts. I think a lot of it is there
don't seem to be a lot of studies where point estimates are the intended
goal (more looking at broader regression models), and at least in body mass
the only study I've seen that does include phylogenetic signal back into
the data uses PVR.

Sincerely,
Russell

On Thu, Jul 1, 2021 at 10:12 AM Cecile Ane  wrote:

>
>
> On Jun 30, 2021, at 11:20 PM, neovenatori...@gmail.com wrote:
>
> This is something that seems really, really concerning because if there is
> a method of using phylogenetic covariance to adjust the position of new
> data points it seems like a lot of workers don’t know these methods exist,
> to the point that even published papers overlook it.
>
>
> I am surprised that researchers don’t know about these methods: they are
> basically just like ancestral state reconstruction, but for new tips in the
> tree. Everyone knows about “ancestral state reconstruction”. But if few
> people recognize that the same methodology can be applied to new taxa, then
> it’s a problem indeed.
>
> Ancestral state “reconstruction” could be better called ancestral state
> “estimation” or “prediction”. In a likelihood framework, we can also give
> prediction intervals, for internal (ancestral) nodes in the tree as well
> as new tips. Not all software handle new taxa, because it requires
> extending the tree, or accepting a tree in which not all taxa have data.
> Taxa with missing data can then be predicted, like any other node.
>
> Cecile.
>
>

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Model Selection and PGLS

2021-07-01 Thread Cecile Ane


On Jun 30, 2021, at 11:20 PM, 
neovenatori...@gmail.com wrote:

This is something that seems really, really concerning because if there is a 
method of using phylogenetic covariance to adjust the position of new data 
points it seems like a lot of workers don’t know these methods exist, to the 
point that even published papers overlook it.

I am surprised that researchers don’t know about these methods: they are 
basically just like ancestral state reconstruction, but for new tips in the 
tree. Everyone knows about “ancestral state reconstruction”. But if few people 
recognize that the same methodology can be applied to new taxa, then it’s a 
problem indeed.

Ancestral state “reconstruction” could be better called ancestral state 
“estimation” or “prediction”. In a likelihood framework, we can also give 
prediction intervals, for internal (ancestral) nodes in the tree as well as new 
tips. Not all software handle new taxa, because it requires extending the tree, 
or accepting a tree in which not all taxa have data. Taxa with missing data can 
then be predicted, like any other node.

Cecile.


[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/


Re: [R-sig-phylo] Model Selection and PGLS

2021-06-30 Thread Theodore Garland
t it) by
>> fitting models flexible enough to accommodate between your BM and OLS case
>> and summarize the results obtained across all the trees…
>
>
> I am not entirely sure what is meant here. Do you mean fitting both an OLS
> and BM model and comparing both models? I am reporting both, but my concern
> is about which model I report is the best one to use going forward, since
> the BM model is seemingly less accurate (though I am just taking the fitted
> values from the PGLS model, which I don't think include phylogenetic
> information). The two models I use produce dramatically different results,
> for example the BM model produces body mass estimates which are 25% larger
> than OLS.
>
> Right now PGLS is something I would avoid if I had the option (if for no
> other reason than not put all of the analyses in a single, overloaded
> manuscript [the manuscript is already about 90 pages] and deviate from the
> scope of the study), but I'm sure you know that most regression analyses
> nowadays require some sort of preliminary PCM to be acceptable.
>
> Sincerely,
> Russell
>
> On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel 
> wrote:
>
>> I think that Cécile' and Theodore' point is important and too often
>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
>> not simply obtained from the fitted line but should incorporates
>> information from the (evolutionary here) model.
>>
>> For multivariate linear model you can also do it by specifying a tree
>> including both the species used to build the model and the ones you want to
>> predict using the “predict” function in mvMORPH (I think that Rphylopars
>> can deal with multivariate phylogenetic regression too).
>>
>> Regarding the model comparison, I would simply avoid it (or limit it) by
>> fitting models flexible enough to accommodate between your BM and OLS case
>> and summarize the results obtained across all the trees…
>>
>> Julien
>>
>>
>> De : R-sig-phylo  de la part de
>> Theodore Garland 
>> Envoyé : mercredi 30 juin 2021 03:26
>> À : Cecile Ane 
>> Cc : mailman, r-sig-phylo ;
>> neovenatori...@gmail.com 
>> Objet : Re: [R-sig-phylo] Model Selection and PGLS
>>
>> All true.  I would just add two things.  First, always graph your data and
>> do ordinary OLS analyses as a reality check.
>>
>> Second, I think this is the original paper for phylogenetic prediction:
>> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
>> present: confidence intervals for regression equations in phylogenetic
>> comparative methods. American Naturalist 155:346–364.
>> There, we talk about the Equivalency of the Independent-Contrasts and
>> Generalized Least Squares Approaches.
>>
>> Cheers,
>> Ted
>>
>>
>> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:
>>
>> > Hi Russel,
>> >
>> > What you see is the large uncertainty in “ancestral” states, which is
>> part
>> > of the intercept here. The linear relationship that you overlaid on top
>> of
>> > your data is the relationship predicted at the root of the tree (as if
>> such
>> > a thing existed!). There is a lot of uncertainty about the intercept,
>> but
>> > much less uncertainty in the slope. It looks like the slope is not
>> affected
>> > by the inclusion or exclusion of monotremes. (for one possible
>> reference on
>> > the greater precision in the slope versus the intercept, there’s this:
>> > http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>> >
>> > My second cent is that the phylogenetic predictions should be stable.
>> The
>> > uncertainty in the intercept —and the large effect of including
>> monotremes
>> > on the intercept— should not affect predictions, so long as you know for
>> > which species you want to make a prediction. If you want to make
>> prediction
>> > for a species in a small clade “far” from monotremes, say, then the
>> > prediction is probably quite stable, even if you include monotremes:
>> this
>> > is because the phylogenetic prediction should use the phylogenetic
>> > relationships for the species to be predicted. A prediction that uses
>> the
>> > linear relationship at the root and ignores the placement of the species
>> > would be the worst-case scenario: for a mammal species with a completely
>> > unknown placement within mammals.
>> >
>> > There’s probably a number of software that do phylogenetic prediction. I
>> > know of Rphylopars 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-30 Thread Russell Engelman
netic
information). The two models I use produce dramatically different results,
for example the BM model produces body mass estimates which are 25% larger
than OLS.

Right now PGLS is something I would avoid if I had the option (if for no
other reason than not put all of the analyses in a single, overloaded
manuscript [the manuscript is already about 90 pages] and deviate from the
scope of the study), but I'm sure you know that most regression analyses
nowadays require some sort of preliminary PCM to be acceptable.

Sincerely,
Russell

On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel 
wrote:

> I think that Cécile' and Theodore' point is important and too often
> overlooked. Using GLS models, the BLUP (Best Linear Unbiased Prediction) is
> not simply obtained from the fitted line but should incorporates
> information from the (evolutionary here) model.
>
> For multivariate linear model you can also do it by specifying a tree
> including both the species used to build the model and the ones you want to
> predict using the “predict” function in mvMORPH (I think that Rphylopars
> can deal with multivariate phylogenetic regression too).
>
> Regarding the model comparison, I would simply avoid it (or limit it) by
> fitting models flexible enough to accommodate between your BM and OLS case
> and summarize the results obtained across all the trees…
>
> Julien
>
>
> De : R-sig-phylo  de la part de
> Theodore Garland 
> Envoyé : mercredi 30 juin 2021 03:26
> À : Cecile Ane 
> Cc : mailman, r-sig-phylo ;
> neovenatori...@gmail.com 
> Objet : Re: [R-sig-phylo] Model Selection and PGLS
>
> All true.  I would just add two things.  First, always graph your data and
> do ordinary OLS analyses as a reality check.
>
> Second, I think this is the original paper for phylogenetic prediction:
> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
> present: confidence intervals for regression equations in phylogenetic
> comparative methods. American Naturalist 155:346–364.
> There, we talk about the Equivalency of the Independent-Contrasts and
> Generalized Least Squares Approaches.
>
> Cheers,
> Ted
>
>
> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:
>
> > Hi Russel,
> >
> > What you see is the large uncertainty in “ancestral” states, which is
> part
> > of the intercept here. The linear relationship that you overlaid on top
> of
> > your data is the relationship predicted at the root of the tree (as if
> such
> > a thing existed!). There is a lot of uncertainty about the intercept, but
> > much less uncertainty in the slope. It looks like the slope is not
> affected
> > by the inclusion or exclusion of monotremes. (for one possible reference
> on
> > the greater precision in the slope versus the intercept, there’s this:
> > http://dx.doi.org/10.1214/13-AOS1105 for the BM).
> >
> > My second cent is that the phylogenetic predictions should be stable. The
> > uncertainty in the intercept —and the large effect of including
> monotremes
> > on the intercept— should not affect predictions, so long as you know for
> > which species you want to make a prediction. If you want to make
> prediction
> > for a species in a small clade “far” from monotremes, say, then the
> > prediction is probably quite stable, even if you include monotremes: this
> > is because the phylogenetic prediction should use the phylogenetic
> > relationships for the species to be predicted. A prediction that uses the
> > linear relationship at the root and ignores the placement of the species
> > would be the worst-case scenario: for a mammal species with a completely
> > unknown placement within mammals.
> >
> > There’s probably a number of software that do phylogenetic prediction. I
> > know of Rphylopars and PhyloNetworks.
> >
> > my 2 cents…
> > Cecile
> >
> > ---
> > Cécile Ané, Professor (she/her)
> > H. I. Romnes Faculty Fellow
> > Departments of Statistics and of Botany
> > University of Wisconsin - Madison
> > www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/>
> >
> > CALS statistical consulting lab:
> > https://calslab.cals.wisc.edu/stat-consulting/
> >
> >
> >
> > On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com > neovenatori...@gmail.com> wrote:
> >
> > Dear All,
> >
> > So this is the main problem I'm facing (see attached figure, which should
> > be small enough to post). When I calculate the best-fit line under a
> > Brownian model, this produces a best-fit line that more or less bypasses
> > the distribution of the data altogether. I did some testing and found
> tha

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-29 Thread Cecile Ane
Hi Russel,

What you see is the large uncertainty in “ancestral” states, which is part of 
the intercept here. The linear relationship that you overlaid on top of your 
data is the relationship predicted at the root of the tree (as if such a thing 
existed!). There is a lot of uncertainty about the intercept, but much less 
uncertainty in the slope. It looks like the slope is not affected by the 
inclusion or exclusion of monotremes. (for one possible reference on the 
greater precision in the slope versus the intercept, there’s this: 
http://dx.doi.org/10.1214/13-AOS1105 for the BM).

My second cent is that the phylogenetic predictions should be stable. The 
uncertainty in the intercept —and the large effect of including monotremes on 
the intercept— should not affect predictions, so long as you know for which 
species you want to make a prediction. If you want to make prediction for a 
species in a small clade “far” from monotremes, say, then the prediction is 
probably quite stable, even if you include monotremes: this is because the 
phylogenetic prediction should use the phylogenetic relationships for the 
species to be predicted. A prediction that uses the linear relationship at the 
root and ignores the placement of the species would be the worst-case scenario: 
for a mammal species with a completely unknown placement within mammals.

There’s probably a number of software that do phylogenetic prediction. I know 
of Rphylopars and PhyloNetworks.

my 2 cents…
Cecile

---
Cécile Ané, Professor (she/her)
H. I. Romnes Faculty Fellow
Departments of Statistics and of Botany
University of Wisconsin - Madison
www.stat.wisc.edu/~ane/

CALS statistical consulting lab: https://calslab.cals.wisc.edu/stat-consulting/



On Jun 29, 2021, at 5:37 PM, 
neovenatori...@gmail.com wrote:

Dear All,

So this is the main problem I'm facing (see attached figure, which should be 
small enough to post). When I calculate the best-fit line under a Brownian 
model, this produces a best-fit line that more or less bypasses the 
distribution of the data altogether. I did some testing and found that this 
result was driven solely by the presence of Monotremata, resulting in the model 
heavily downweighting all of the phylogenetic variation within Theria in favor 
of the deep divergence between Monotremata and Theria. Excluding Monotremata 
produces a PGLS fit that's comparable enough to the OLS and OU model fit to be 
justifiable (though I can't just throw out Monotremata for the sake of throwing 
it out).

I am planning to do a more theoretical investigation into the effect of 
Monotremata on the PGLS fit in a future study, but right now what I am trying 
to do is perform a study in which I use this data to construct a regression 
model that can be used to predict new data. Which is why I am trying to use AIC 
to potentially justify going with OLS or an OU model over a Brownian model. 
From a practical perspective the Brownian model is almost unusable because it 
produces systematically biased estimates with high error rates when applied to 
new data (error rate is roughly double that of both the OLS and OU model). This 
is especially the case because the data must be back-transformed into an 
arithmetic scale to be useable, and thus a seemingly minor difference in 
regression models results in a massive difference in predicted values. However, 
I need some objective test to show that OLS fits the data better than the 
Brownian model, hence why I was going with AIC. Overall, OLS does seem to 
outperform the Brownian model on average, but the variation in AIC is so high 
it is hard to interpret this.

This is kind of why I am leery of assuming a null Brownian model. A Brownian 
model, if anything, does not seem to accurately model the relationship between 
variables.

This is why I am having trouble figuring out how to do model selection. Just 
going with accuracy statistics like percent error or standard error of the 
estimate OLS is better from a purely practical sense (it doesn't work for the 
monotreme taxa, but it turns out that estimate error in the monotremes is only 
decreased by 10% in a Brownian model when it overestimates mass by nearly 75%, 
so the improvement really isn't worth it and using this for monotremes isn't 
recommended in the first place), but the reviewers are expressing skepticism 
over the fact that the Brownian model produces less useable results. And I'm 
not entirely sure the best way to go about the PGLS if using one of the 
birth-death trees isn't ideal, perhaps what Dr. Upham says about using the DNA 
tree might work better.

Ironically, an OU model might be argued to better fit the data, despite the 
concerns that Dr. Bapst mentioned. Looking at the distribution of signal even 
though signal is not random, it is more accurately described as most taxa 
hewing to a stable equilibrium with rapid, high magnitude shifts at certain 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-29 Thread Theodore Garland
All true.  I would just add two things.  First, always graph your data and
do ordinary OLS analyses as a reality check.

Second, I think this is the original paper for phylogenetic prediction:
Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the
present: confidence intervals for regression equations in phylogenetic
comparative methods. American Naturalist 155:346–364.
There, we talk about the Equivalency of the Independent-Contrasts and
Generalized Least Squares Approaches.

Cheers,
Ted


On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane  wrote:

> Hi Russel,
>
> What you see is the large uncertainty in “ancestral” states, which is part
> of the intercept here. The linear relationship that you overlaid on top of
> your data is the relationship predicted at the root of the tree (as if such
> a thing existed!). There is a lot of uncertainty about the intercept, but
> much less uncertainty in the slope. It looks like the slope is not affected
> by the inclusion or exclusion of monotremes. (for one possible reference on
> the greater precision in the slope versus the intercept, there’s this:
> http://dx.doi.org/10.1214/13-AOS1105 for the BM).
>
> My second cent is that the phylogenetic predictions should be stable. The
> uncertainty in the intercept —and the large effect of including monotremes
> on the intercept— should not affect predictions, so long as you know for
> which species you want to make a prediction. If you want to make prediction
> for a species in a small clade “far” from monotremes, say, then the
> prediction is probably quite stable, even if you include monotremes: this
> is because the phylogenetic prediction should use the phylogenetic
> relationships for the species to be predicted. A prediction that uses the
> linear relationship at the root and ignores the placement of the species
> would be the worst-case scenario: for a mammal species with a completely
> unknown placement within mammals.
>
> There’s probably a number of software that do phylogenetic prediction. I
> know of Rphylopars and PhyloNetworks.
>
> my 2 cents…
> Cecile
>
> ---
> Cécile Ané, Professor (she/her)
> H. I. Romnes Faculty Fellow
> Departments of Statistics and of Botany
> University of Wisconsin - Madison
> www.stat.wisc.edu/~ane/
>
> CALS statistical consulting lab:
> https://calslab.cals.wisc.edu/stat-consulting/
>
>
>
> On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com neovenatori...@gmail.com> wrote:
>
> Dear All,
>
> So this is the main problem I'm facing (see attached figure, which should
> be small enough to post). When I calculate the best-fit line under a
> Brownian model, this produces a best-fit line that more or less bypasses
> the distribution of the data altogether. I did some testing and found that
> this result was driven solely by the presence of Monotremata, resulting in
> the model heavily downweighting all of the phylogenetic variation within
> Theria in favor of the deep divergence between Monotremata and Theria.
> Excluding Monotremata produces a PGLS fit that's comparable enough to the
> OLS and OU model fit to be justifiable (though I can't just throw out
> Monotremata for the sake of throwing it out).
>
> I am planning to do a more theoretical investigation into the effect of
> Monotremata on the PGLS fit in a future study, but right now what I am
> trying to do is perform a study in which I use this data to construct a
> regression model that can be used to predict new data. Which is why I am
> trying to use AIC to potentially justify going with OLS or an OU model over
> a Brownian model. From a practical perspective the Brownian model is almost
> unusable because it produces systematically biased estimates with high
> error rates when applied to new data (error rate is roughly double that of
> both the OLS and OU model). This is especially the case because the data
> must be back-transformed into an arithmetic scale to be useable, and thus a
> seemingly minor difference in regression models results in a massive
> difference in predicted values. However, I need some objective test to show
> that OLS fits the data better than the Brownian model, hence why I was
> going with AIC. Overall, OLS does seem to outperform the Brownian model on
> average, but the variation in AIC is so high it is hard to interpret this.
>
> This is kind of why I am leery of assuming a null Brownian model. A
> Brownian model, if anything, does not seem to accurately model the
> relationship between variables.
>
> This is why I am having trouble figuring out how to do model selection.
> Just going with accuracy statistics like percent error or standard error of
> the estimate OLS is better from a purely practical sense (it doesn't work
> for the monotreme taxa, but it turns out that estimate error in the
> monotremes is only decreased by 10% in a Brownian model when it
> overestimates mass by nearly 75%, so the improvement really isn't worth it
> and using this for 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-29 Thread Theodore Garland
The other possible null model would be a "star" phylogeny with no
hierarchical structure, equal-length branches, and also Brownian motion.
But that's generally viewed as outside of the range of reasonable
possibilities.
Cheers,
Ted

On Tue, Jun 29, 2021 at 12:05 PM Nathan Upham  wrote:

> Hi Russell and all, sounds good.
>
> I’d suggest that the “null model” for fitting trait data to a phylogeny
> should be single-rate Brownian motion, i.e., you’re assuming that given
> data on the ancestor-to-descendant relationships of the species (and timing
> of divergences), and assuming the trait is heritable, it is evolving at the
> same random rate along each branch.  The burden of proof is on rejecting
> that null hypothesis (not “accepting it”— sorry for earlier writing that
> incorrectly!).  So if you do your AIC fitting across the 100 trees,
> summarize the results, and find no clear signal of a model being obviously
> better than single-rate Brownian, then that is what you should use for
> subsequent analyses.
>
> If anyone has a different perspective on this, please chime in.  The above
> assumes that you’ve established heritability of the trait, for example by
> doing a test for phylogenetic ’signal’.
>
> Does that help?  All the best
> —nate
>
>
>
> > On Jun 28, 2021, at 1:25 PM, Russell Engelman 
> wrote:
> >
> > Dear Dr. Upham (and All),
> >
> > Please don't take my initial message the wrong way, this is not meant to
> be a dig at your 2019 study. I don’t think this is due to the birth-death
> tree specifically but would be present in any study where there are
> multiple phylogenetic trees to choose from or some measure of uncertainty
> in the tip dates. I definitely agree with you that there is almost
> certainly going to be variation in model support values if there is any
> difference in the underlying phylogeny, however, I was surprised that AIC
> would vary this much in a dataset where the trait data, number of tips, and
> branching topology used to compute the model are more or less constant
> between trees.
> >
> > My question is more along the lines of "given that it is logical to
> expect AIC to vary based on differences between trees, how would one go
> about determining which regression model is the "optimal" one to use for
> further analysis"? You mentioned taking the 95% confidence intervals of the
> models and seeing if they don't overlap, would this be just taking the
> singular AIC from the OLS model and comparing it to the PGLS one, since OLS
> seemingly doesn't produce a confidence interval of AIC values? And if the
> confidence intervals do overlap, is the OLS  or PGLS considered the null
> hypothesis? In my case the AIC for OLS is within the 95% confidence
> intervals for the PGLS, but is much lower than the mean value (it's close
> to the lower first standard deviation of the AIC values).
> >
> > Sincerely,
> > Russell
> >
> > On Mon, Jun 28, 2021 at 2:46 PM Nathan Upham  > wrote:
> > Hi Russell and all:
> >
> > I’ll respond here since the answer is related to the intended purpose of
> the VertLife mammal trees — i.e, capturing full uncertainty in node ages
> and phylogenetic relationships was one of the motivators for building the
> mammal trees in the way we did.  This approach contrasts to wanting to
> obtain the single “best tree”, since methods of phylogenetic reconstruction
> will always just be approximations of the “true tree” anyway rather than
> ever being equal to that tree.  To only use a single consensus tree in
> comparative phylogenetic analyses assumes that we know the true tree, which
> again, we don’t ever in an empirical context (only for simulations).  Those
> points were summarized well by Huelsenbeck et al. (2000:
> http://science.sciencemag.org/content/288/5475/2349 <
> https://urldefense.com/v3/__http://science.sciencemag.org/content/288/5475/2349__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJczwsIVKQ$>),
> but nevertheless are still not standard practice in PCMs.
> >
> > To the point of AIC varying across the 100 trees, this is to be
> expected.  Any 1 tree of 100 trees from the credible set is not very
> meaningful; the entire 100 trees need to be analyzed and then the estimate
> +/- SE from each tree can be summarized as a distribution of values.  If
> the 95% CI on the distribution of values excludes your hypothesis, then
> you’ve learned something; if not, you accept the null hypothesis.  See the
> animated gifs here (http://vertlife.org/data/mammals/ <
> https://urldefense.com/v3/__http://vertlife.org/data/mammals/__;!!IKRxdwAv5BmarQ!NHbBAhu7TSMQWrZivqX9-FnG8bIhVy_zP2y-3oDjm_A6NttbEbrXO3uQoJdB22hcow$>)
> for a better conception of why this phylogenetic uncertainty is important
> to consider when doing model fitting or other PCMs.
> >
> > That said, if a single ‘best tree’ is the target, then the DNA-only MCC
> tree of 4098 species is a reasonable thing to analyze, more analogous to
> how 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-29 Thread David Bapst
Hi all, Russell, Nate,

I took an interest with some of the commentary here, so here's my two cents.

> I was surprised that AIC would vary this much in a dataset where the trait 
> data, number of tips, and branching
> topology used to compute the model are more or less constant between trees.

I am not surprised to hear of substantial variation in model support.
In my experience, model support can vary wildly among trees that share
the same taxa and the overall structure (they'd all more or less look
like the same tree if printed on a conference poster without
tip-labels). Most of the information is constrained by where the short
edges are in the tree, especially short terminal edges that connect to
tips. Varying which terminal edges are very short, or varying their
placement can produce a lot of variation in which model is best
supported by any information criterion.

All of this especially applies when we consider the OU (single optima)
model among our set of models, which is like a great big sponge that
likes to eat noise, but also applies just when considering between BM
or a signal-less white noise (OLS) model, or when evaluating really
any test that tries to ascertain if there is phylogenetic signal or
not.

And this all follows naturally. Obviously, if we are uncertain about
the timing of recent evolutionary divergences, we are uncertain how
rapid evolution has been recently. If recent change is very rapid,
then it will look like rates are accelerating (ala OU model) and/or
there is very poor phylogenetic signal (because old divergences in
trait-space do not correspond to the larger differences in the
dataset).

> ...given that it is logical to expect AIC to vary based on differences 
> between trees, how would one go about
> determining which regression model is the "optimal" one to use for further 
> analysis..?

I think this is a place where an unguided examination of the traits in
question cannot go any further. Frankly, you need to interrogate what
your expectation is: do you think these traits have phylogenetic
signal? Do they not? Looking at the trees, do you have any sense in
what edges, and what edge lengths are controlling the variation you
are seeing in model support?

I would not state it in the same null-model terms that Nate uses,
because referring to the PGLS (BM) as the null model means that in
some ways we are making implicit assumptions about what our default
assumptions are. I think it would be better to more carefully consider
what our expectations for this particular system is. Do you think
descendant populations are generally like their ancestors over
evolutionary time? Do you have some reason for thinking otherwise?

I would not give the OU model much consideration, unless you wish to
use it as a description of a phylogenetic process with decreased
phylogenetic signal (as Carl Boettiger once advocated in this eight
year old blog post that I still tell people to read:
https://www.carlboettiger.info/2013/10/11/is-it-time-to-retire-pagels-lambda.html).

Ultimately, though, it sounds like what you will find is a mixed
answer, so why not just take a model that allows phylogenetic signal
to scale from non-existent to very strong (like the lambda PGLS model
or the OU PGLS model), fit it to each tree in a large set, and then
you can look at how other model parameters vary with the apparent
strength of phylogenetic signal? Perhaps this may be more informative
to you than choosing a model based on a wildly varying information
criterion, or getting a 'yes' or 'no' to whether phylogenetic signal
exists in your data.

Cheers,
-Dave

On Tue, Jun 29, 2021 at 2:05 PM Nathan Upham  wrote:
>
> Hi Russell and all, sounds good.
>
> I’d suggest that the “null model” for fitting trait data to a phylogeny 
> should be single-rate Brownian motion, i.e., you’re assuming that given data 
> on the ancestor-to-descendant relationships of the species (and timing of 
> divergences), and assuming the trait is heritable, it is evolving at the same 
> random rate along each branch.  The burden of proof is on rejecting that null 
> hypothesis (not “accepting it”— sorry for earlier writing that incorrectly!). 
>  So if you do your AIC fitting across the 100 trees, summarize the results, 
> and find no clear signal of a model being obviously better than single-rate 
> Brownian, then that is what you should use for subsequent analyses.
>
> If anyone has a different perspective on this, please chime in.  The above 
> assumes that you’ve established heritability of the trait, for example by 
> doing a test for phylogenetic ’signal’.
>
> Does that help?  All the best
> —nate
>
>
>
> > On Jun 28, 2021, at 1:25 PM, Russell Engelman  
> > wrote:
> >
> > Dear Dr. Upham (and All),
> >
> > Please don't take my initial message the wrong way, this is not meant to be 
> > a dig at your 2019 study. I don’t think this is due to the birth-death tree 
> > specifically but would be present in any study where there are multiple 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-29 Thread Nathan Upham
Hi Russell and all, sounds good.

I’d suggest that the “null model” for fitting trait data to a phylogeny should 
be single-rate Brownian motion, i.e., you’re assuming that given data on the 
ancestor-to-descendant relationships of the species (and timing of 
divergences), and assuming the trait is heritable, it is evolving at the same 
random rate along each branch.  The burden of proof is on rejecting that null 
hypothesis (not “accepting it”— sorry for earlier writing that incorrectly!).  
So if you do your AIC fitting across the 100 trees, summarize the results, and 
find no clear signal of a model being obviously better than single-rate 
Brownian, then that is what you should use for subsequent analyses.

If anyone has a different perspective on this, please chime in.  The above 
assumes that you’ve established heritability of the trait, for example by doing 
a test for phylogenetic ’signal’.

Does that help?  All the best
—nate



> On Jun 28, 2021, at 1:25 PM, Russell Engelman  
> wrote:
> 
> Dear Dr. Upham (and All),
> 
> Please don't take my initial message the wrong way, this is not meant to be a 
> dig at your 2019 study. I don’t think this is due to the birth-death tree 
> specifically but would be present in any study where there are multiple 
> phylogenetic trees to choose from or some measure of uncertainty in the tip 
> dates. I definitely agree with you that there is almost certainly going to be 
> variation in model support values if there is any difference in the 
> underlying phylogeny, however, I was surprised that AIC would vary this much 
> in a dataset where the trait data, number of tips, and branching topology 
> used to compute the model are more or less constant between trees.
> 
> My question is more along the lines of "given that it is logical to expect 
> AIC to vary based on differences between trees, how would one go about 
> determining which regression model is the "optimal" one to use for further 
> analysis"? You mentioned taking the 95% confidence intervals of the models 
> and seeing if they don't overlap, would this be just taking the singular AIC 
> from the OLS model and comparing it to the PGLS one, since OLS seemingly 
> doesn't produce a confidence interval of AIC values? And if the confidence 
> intervals do overlap, is the OLS  or PGLS considered the null hypothesis? In 
> my case the AIC for OLS is within the 95% confidence intervals for the PGLS, 
> but is much lower than the mean value (it's close to the lower first standard 
> deviation of the AIC values).
> 
> Sincerely,
> Russell
> 
> On Mon, Jun 28, 2021 at 2:46 PM Nathan Upham  > wrote:
> Hi Russell and all:
> 
> I’ll respond here since the answer is related to the intended purpose of the 
> VertLife mammal trees — i.e, capturing full uncertainty in node ages and 
> phylogenetic relationships was one of the motivators for building the mammal 
> trees in the way we did.  This approach contrasts to wanting to obtain the 
> single “best tree”, since methods of phylogenetic reconstruction will always 
> just be approximations of the “true tree” anyway rather than ever being equal 
> to that tree.  To only use a single consensus tree in comparative 
> phylogenetic analyses assumes that we know the true tree, which again, we 
> don’t ever in an empirical context (only for simulations).  Those points were 
> summarized well by Huelsenbeck et al. (2000: 
> http://science.sciencemag.org/content/288/5475/2349 
> ),
>  but nevertheless are still not standard practice in PCMs.
> 
> To the point of AIC varying across the 100 trees, this is to be expected.  
> Any 1 tree of 100 trees from the credible set is not very meaningful; the 
> entire 100 trees need to be analyzed and then the estimate +/- SE from each 
> tree can be summarized as a distribution of values.  If the 95% CI on the 
> distribution of values excludes your hypothesis, then you’ve learned 
> something; if not, you accept the null hypothesis.  See the animated gifs 
> here (http://vertlife.org/data/mammals/ 
> )
>  for a better conception of why this phylogenetic uncertainty is important to 
> consider when doing model fitting or other PCMs.
> 
> That said, if a single ‘best tree’ is the target, then the DNA-only MCC tree 
> of 4098 species is a reasonable thing to analyze, more analogous to how 
> mainstream phylogenetics has presented trees for re-use 
> (https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre
>  
> 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-28 Thread Russell Engelman
Dear Dr. Upham (and All),

Please don't take my initial message the wrong way, this is not meant to be
a dig at your 2019 study. I don’t think this is due to the birth-death tree
specifically but would be present in any study where there are multiple
phylogenetic trees to choose from or some measure of uncertainty in the tip
dates. I definitely agree with you that there is almost certainly going to
be variation in model support values if there is any difference in the
underlying phylogeny, however, I was surprised that AIC would vary this
much in a dataset where the trait data, number of tips, and branching
topology used to compute the model are more or less constant between trees.

My question is more along the lines of "given that it is logical to expect
AIC to vary based on differences between trees, how would one go about
determining which regression model is the "optimal" one to use for further
analysis"? You mentioned taking the 95% confidence intervals of the models
and seeing if they don't overlap, would this be just taking the singular
AIC from the OLS model and comparing it to the PGLS one, since OLS
seemingly doesn't produce a confidence interval of AIC values? And if the
confidence intervals do overlap, is the OLS  or PGLS considered the null
hypothesis? In my case the AIC for OLS is within the 95% confidence
intervals for the PGLS, but is much lower than the mean value (it's close
to the lower first standard deviation of the AIC values).

Sincerely,
Russell

On Mon, Jun 28, 2021 at 2:46 PM Nathan Upham  wrote:

> Hi Russell and all:
>
> I’ll respond here since the answer is related to the intended purpose of
> the VertLife mammal trees — i.e, capturing full uncertainty in node ages
> and phylogenetic relationships was one of the motivators for building the
> mammal trees in the way we did.  This approach contrasts to wanting to
> obtain the single “best tree”, since methods of phylogenetic reconstruction
> will always just be approximations of the “true tree” anyway rather than
> ever being equal to that tree.  To only use a single consensus tree in
> comparative phylogenetic analyses assumes that we know the true tree, which
> again, we don’t ever in an empirical context (only for simulations).  Those
> points were summarized well by Huelsenbeck et al. (2000:
> http://science.sciencemag.org/content/288/5475/2349), but nevertheless
> are still not standard practice in PCMs.
>
> To the point of AIC varying across the 100 trees, this is to be expected.
> Any 1 tree of 100 trees from the credible set is not very meaningful; the
> entire 100 trees need to be analyzed and then the estimate +/- SE from each
> tree can be summarized as a distribution of values.  If the 95% CI on the
> distribution of values excludes your hypothesis, then you’ve learned
> something; if not, you accept the null hypothesis.  See the animated gifs
> here (http://vertlife.org/data/mammals/) for a better conception of why
> this phylogenetic uncertainty is important to consider when doing model
> fitting or other PCMs.
>
> That said, if a single ‘best tree’ is the target, then the DNA-only MCC
> tree of 4098 species is a reasonable thing to analyze, more analogous to
> how mainstream phylogenetics has presented trees for re-use (
> https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre).
> But again, while the MCC tree is appropriate, 1 of 100 trees from the
> credible set is not.
>
> Hope that helps.  All the best,
> —nate
>
>
>
>
> ==
> Nathan S. Upham, Ph.D. (he/him)
> Assistant Research Professor & Associate Curator of Mammals
> Arizona State University, School of Life Sciences, Biodiversity Knowledge
> Integration Center (BioKIC )
>  ~> Check out the new Mammal Tree of Life
>  and the Mammal Diversity Database
> 
>
> Research Associate, Yale University (Ecology and Evolutionary Biology)
> Research Associate, Field Museum of Natural History (Negaunee Integrative
> Research Center)
> Chair, Biodiversity Committee, American Society of Mammalogists
> Taxonomy Advisor, IUCN/SSC Small Mammal Specialist Group
>
> personal web: n8u.org | Google Scholar
> 
>  | ASU profile 
> e: nathan.up...@asu.edu | Skype: nate_upham | Twitter: @n8_upham
> 
>
> =
>
>
>
> On Jun 28, 2021, at 10:47 AM, Russell Engelman 
> wrote:
>
> Dear R-Sig-Phylo Mailing List,
>
> I ran into a rather unusual problem. I was doing an analysis using the
> mammal trees from Upham et al. 

Re: [R-sig-phylo] Model Selection and PGLS

2021-06-28 Thread Nathan Upham
Hi Russell and all:

I’ll respond here since the answer is related to the intended purpose of the 
VertLife mammal trees — i.e, capturing full uncertainty in node ages and 
phylogenetic relationships was one of the motivators for building the mammal 
trees in the way we did.  This approach contrasts to wanting to obtain the 
single “best tree”, since methods of phylogenetic reconstruction will always 
just be approximations of the “true tree” anyway rather than ever being equal 
to that tree.  To only use a single consensus tree in comparative phylogenetic 
analyses assumes that we know the true tree, which again, we don’t ever in an 
empirical context (only for simulations).  Those points were summarized well by 
Huelsenbeck et al. (2000: http://science.sciencemag.org/content/288/5475/2349), 
but nevertheless are still not standard practice in PCMs.

To the point of AIC varying across the 100 trees, this is to be expected.  Any 
1 tree of 100 trees from the credible set is not very meaningful; the entire 
100 trees need to be analyzed and then the estimate +/- SE from each tree can 
be summarized as a distribution of values.  If the 95% CI on the distribution 
of values excludes your hypothesis, then you’ve learned something; if not, you 
accept the null hypothesis.  See the animated gifs here 
(http://vertlife.org/data/mammals/) for a better conception of why this 
phylogenetic uncertainty is important to consider when doing model fitting or 
other PCMs.

That said, if a single ‘best tree’ is the target, then the DNA-only MCC tree of 
4098 species is a reasonable thing to analyze, more analogous to how mainstream 
phylogenetics has presented trees for re-use 
(https://github.com/n8upham/MamPhy_v1/blob/master/_DATA/MamPhy_fullPosterior_BDvr_DNAonly_4098sp_topoFree_NDexp_MCC_v2_target.tre).
  But again, while the MCC tree is appropriate, 1 of 100 trees from the 
credible set is not.

Hope that helps.  All the best,
—nate



==
Nathan S. Upham, Ph.D. (he/him)
Assistant Research Professor & Associate Curator of Mammals
Arizona State University, School of Life Sciences, Biodiversity Knowledge 
Integration Center (BioKIC )
 ~> Check out the new Mammal Tree of Life 
 and the Mammal Diversity Database 


Research Associate, Yale University (Ecology and Evolutionary Biology)
Research Associate, Field Museum of Natural History (Negaunee Integrative 
Research Center)
Chair, Biodiversity Committee, American Society of Mammalogists
Taxonomy Advisor, IUCN/SSC Small Mammal Specialist Group

personal web: n8u.org | Google Scholar 

 | ASU profile 
e: nathan.up...@asu.edu | Skype: nate_upham | Twitter: @n8_upham 
 
=



> On Jun 28, 2021, at 10:47 AM, Russell Engelman  
> wrote:
> 
> Dear R-Sig-Phylo Mailing List,
> 
> I ran into a rather unusual problem. I was doing an analysis using the
> mammal trees from Upham et al. (2019) downloaded off of the VertLife site.
> The model statistics for my data initially suggested that the OLS model was
> better supported than a PGLS model based on Akaike Information Criterion
> (AIC). The reviewers for the paper wanted me to add more taxa, so I
> re-downloaded a set of trees from VertLife and reran the analysis, but when
> I did I found that suddenly the AIC values for the PGLS equation were
> dramatically different, to the point that it favored a Brownian PGLS model
> over all other models. This was despite the fact that previously I found
> that an OLS model and an OU model had a better model fit than a Brownian
> model, and the other accuracy statistics of interest (like percent error,
> this being a model intended for use in predicting new data) also found OLS
> and OU models to fit better than a Brownian PGLS model. The regression line
> for a Brownian model doesn't even fit the data at all due to being biased
> by a basal clade. The model also has a high amount of phylogenetic inertia
> which again would seemingly make an OU model a better option.
> 
> I used drop.tip to remove the additional taxa to see if I could replicate
> my previous results, but it turns out I still couldn't replicate the
> results. That's when I realized what was causing the change in AIC values
> wasn't the taxon selection, but the tree I was using. If I used the old
> VertLife tree I could replicate the results, but the new VertLife tree
> produced radically different results despite using the same tips. So what I
> decided to do is rerun the analysis for all 100 trees I had available, and
> it turned 

[R-sig-phylo] Model Selection and PGLS

2021-06-28 Thread Russell Engelman
Dear R-Sig-Phylo Mailing List,

I ran into a rather unusual problem. I was doing an analysis using the
mammal trees from Upham et al. (2019) downloaded off of the VertLife site.
The model statistics for my data initially suggested that the OLS model was
better supported than a PGLS model based on Akaike Information Criterion
(AIC). The reviewers for the paper wanted me to add more taxa, so I
re-downloaded a set of trees from VertLife and reran the analysis, but when
I did I found that suddenly the AIC values for the PGLS equation were
dramatically different, to the point that it favored a Brownian PGLS model
over all other models. This was despite the fact that previously I found
that an OLS model and an OU model had a better model fit than a Brownian
model, and the other accuracy statistics of interest (like percent error,
this being a model intended for use in predicting new data) also found OLS
and OU models to fit better than a Brownian PGLS model. The regression line
for a Brownian model doesn't even fit the data at all due to being biased
by a basal clade. The model also has a high amount of phylogenetic inertia
which again would seemingly make an OU model a better option.

I used drop.tip to remove the additional taxa to see if I could replicate
my previous results, but it turns out I still couldn't replicate the
results. That's when I realized what was causing the change in AIC values
wasn't the taxon selection, but the tree I was using. If I used the old
VertLife tree I could replicate the results, but the new VertLife tree
produced radically different results despite using the same tips. So what I
decided to do is rerun the analysis for all 100 trees I had available, and
it turned out there was a massive amount of variation in AIC depending on
what tree was chosen. I tried including an html data printout to show the
precise results and how I got them, but I couldn't attach them because the
mailer daemon kept saying they were too large. The AIC values between trees
vary by almost 200 points after excluding extreme outliers, when model
differences of 2 or more are often considered to represent statistically
detectable differences. The unusually low AIC I got when I first ran the
analysis happened to be because the first tree in the 100 trees merely
happened to produce a lower-than-average AIC than the whole sample. The
average AIC out of the 100 trees was higher than for the OLS model, which
again makes sense given the distribution of the data.

However, and this is where my problem comes in, how do I make appropriate
model selections for PGLS if there is such a massive amount of variation in
AIC? Especially given that between the trees in the sample there is enough
variation that it can cause one model to be favored over another? Just
picking one tree and going with that seems counterintuitive, because it's
not very objective and theoretically someone could pick a specific tree to
get the results they want, or accidentally pick a tree that might support
the wrong model as seen here. On top of that the tree topologies are more
or less identical: the same 404 taxa are present in all trees and the trees
have nearly identical topologies, the only real differences between trees
are branch lengths. But given this, how can I justify which AIC value I
report, which in turn means which model is best supported?

I did try looking at the phylo_lm function in the sensiphy package, but
that function doesn't seem to provide any method of performing model
selection between different regression models. It does seemingly report
AIC, but the AIC the function reported was dramatically different from the
aic I got using the gls function in ape and nlme.

Sincerely,
Russell

[[alternative HTML version deleted]]

___
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/