Simone, thanks for sending that! As a note of caution for users, make sure you compute PREDICTION INTERVALS, not confidence intervals. The former are generally much wider. Cheers, Ted
On Fri, Jul 2, 2021 at 1:55 AM Simone Blomberg <s.blombe...@uq.edu.au> wrote: > Hi all, > > Here is some example R code that implements the formula in Ted and > Tony's 2000 paper. I haven't tested it thoroughly, so caveat emptor. I > did check that there is no bias due to phylogenetic effects if you use a > star phylogeny, which gives me some comfort. I haven't bothered with > computing the prediction intervals, but it is pretty straight forward I > think. > > Best, > Simone. > > library(ape) > library(nlme) > library(MASS) > set.seed(182347) > tree <- compute.brlen(rtree(50)) > x <- rnorm(50) > y <- 3*x+mvrnorm(mu=rep(0, 50), Sigma=vcv.phylo(tree)) ## add physig to y > y[sample(length(y), 4)] <- NA ## create some missing values in y to predict > dat <- data.frame(x=x, y=y, species=tree$tip.label) > > tree.trimmed <- drop.tip(tree, which(is.na(dat$y))) ## say we first do > the analysis on the non-missing cases > dat.trimmed <- dat[complete.cases(dat),] > fit <- gls(y~x, correlation=corPagel(0.5, phy=tree.trimmed, form=~species), > data=dat.trimmed) > lambda <- as.numeric(fit$modelStruct) ## will assume the same lambda > applies to missing and nonmissing cases. I'm not sure this is reasonable? > > ## lambda > ## [1] 1.003751 > > model.matFULL <- model.matrix(~x, data=dat) ## design matrix for model > > missvals <- which(is.na(dat$y)) > nmiss <- length(missvals) > preds <- predict(fit, newdata=dat[missvals,]) ## get predictions from > model, assuming species are at the root. > mat <- vcv.phylo(tree) > C <- 1+(lambda-1)*(1-mat) ## Lambda correction > > CCompleteData <- C[-missvals, -missvals] ## remove the missing species > cihmat <- C[missvals, -missvals] ## get the relationships of the "new" > species to the previous ones. > > XnoMean <- scale(model.matFULL[-missvals,], scale=FALSE) ## centre the > model matrix > > mu <- cihmat %*% solve(CCompleteData) %*% XnoMean ## formula from Ted > and Tony's 2000 paper, appendix B, last paragraph. > > preds > ## [1] -2.440567 -2.370682 -1.035975 -2.583952 > > mu > ## (Intercept) x > ## t26 0 -0.2208723 > ## t27 0 0.5918446 > ## t43 0 -1.1822058 > ## t39 0 -0.8706877 > > preds+mu[,2] ## "correct" prediction accounting for bias due to > correlations with other species. > > ## t26 t27 t43 t39 > ## -2.661439 -1.778837 -2.218181 -3.454639 > > On 1/7/21 3:09 pm, Theodore Garland wrote: > > Russell, > > Please read this paper: > > https://pubmed.ncbi.nlm.nih.gov/10718731/ > > Cheers > > Ted > > > > > > On Wed, Jun 30, 2021, 9:21 PM Russell Engelman <neovenatori...@gmail.com > > > > wrote: > > > >> Dear All, > >> > >> What you see is the large uncertainty in “ancestral” states, which is > part > >>> of the intercept here. The linear relationship that you overlaid on > top of > >>> your data is the relationship predicted at the root of the tree (as if > such > >>> a thing existed!). There is a lot of uncertainty about the intercept, > but > >>> much less uncertainty in the slope. It looks like the slope is not > affected > >>> by the inclusion or exclusion of monotremes. (for one possible > reference on > >>> the greater precision in the slope versus the intercept, there’s this: > >>> http://dx.doi.org/10.1214/13-AOS1105 for the BM). > >> > >> Yes, that sounds right from the other data I have. The line approximates > >> what would be expected for the root of Mammalia, and the signal in the > PGLS > >> is more due to shifts in the y-intercept than shifts in slope, which in > >> turn is supported by the anatomy of the proxy. > >> > >> My second cent is that the phylogenetic predictions should be stable. > The > >>> uncertainty in the intercept —and the large effect of including > monotremes > >>> on the intercept— should not affect predictions, so long as you know > for > >>> which species you want to make a prediction. If you want to make > prediction > >>> for a species in a small clade “far” from monotremes, say, then the > >>> prediction is probably quite stable, even if you include monotremes: > this > >>> is because the phylogenetic prediction should use the phylogenetic > >>> relationships for the species to be predicted. A prediction that uses > the > >>> linear relationship at the root and ignores the placement of the > species > >>> would be the worst-case scenario: for a mammal species with a > completely > >>> unknown placement within mammals. > >> > >> This is what I'm a bit confused about. I was always told (and it > seemingly > >> implies this in some of the PGLS literature I read like Rohlf 2011 and > >> Smaers and Rohlf 2016) that it isn't possible to include phylogenetic > data > >> from the new data points into the prediction in order to improve > >> predictions. I'm a little confused as to whether it's possible or not > (see > >> below). > >> > >> There’s probably a number of software that do phylogenetic prediction. I > >>> know of Rphylopars and PhyloNetworks. > >> > >> I will take a look into those. > >> > >> I think that Cécile' and Theodore' point is important and too often > >>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased > Prediction) is > >>> not simply obtained from the fitted line but should incorporates > >>> information from the (evolutionary here) model. > >> > >> There’s a way to impute phylogenetic signal back into a PGLS > model? I > >> am super surprised at that. I’ve talked to at least three different > >> colleagues who use PGLS about this issue, and all of them had told me > that > >> there is no way to input phylogenetic signal back into the model for new > >> data points and I should just go with the single regression line the > model > >> gives me (i.e., the regression line for the ancestral node). > >> > >> I tried looking around to see what previous researchers used when > >> using PCM on body mass (Esteban-Trivigno and Köhler 2011, Campione and > >> Evans 2012, Yapuncich 2017 thesis) and it looks like all of them just > went > >> with the best fit line with the ancestral node, i.e., looking at their > >> reported results they give a simple trait~predictor equation that does > not > >> include phylogeny when calculating new data. Campion and Evans 2012 used > >> PIC versus PGLS, which I know are technically equivalent but it doesn't > >> seem like they included phylogenetic information when they predicted new > >> data: they used their equations on dinosaurs but there are no dinosaurs > in > >> the tree they used. I know that it’s possible to incorporate > phylogenetic > >> signal into the new data using PVR but PVR has been criticized for other > >> reasons. > >> > >> This is something that seems really, really concerning because if > >> there is a method of using phylogenetic covariance to adjust the > position > >> of new data points it seems like a lot of workers don’t know these > methods > >> exist, to the point that even published papers overlook it. This was > >> something I was hoping to highlight in a later paper on the data, but it > >> sounds like people might have discussed it already. I remember talking > with > >> my colleagues a lot about "isn't there some way to incorporate > phylogenetic > >> information back into the model to improve accuracy of the prediction > if we > >> know where the taxon is positioned?" and they just thought there wasn't > a > >> way. > >> > >> Regarding the model comparison, I would simply avoid it (or limit it) by > >>> fitting models flexible enough to accommodate between your BM and OLS > case > >>> and summarize the results obtained across all the trees… > >> > >> I am not entirely sure what is meant here. Do you mean fitting both an > OLS > >> and BM model and comparing both models? I am reporting both, but my > concern > >> is about which model I report is the best one to use going forward, > since > >> the BM model is seemingly less accurate (though I am just taking the > fitted > >> values from the PGLS model, which I don't think include phylogenetic > >> information). The two models I use produce dramatically different > results, > >> for example the BM model produces body mass estimates which are 25% > larger > >> than OLS. > >> > >> Right now PGLS is something I would avoid if I had the option (if for no > >> other reason than not put all of the analyses in a single, overloaded > >> manuscript [the manuscript is already about 90 pages] and deviate from > the > >> scope of the study), but I'm sure you know that most regression analyses > >> nowadays require some sort of preliminary PCM to be acceptable. > >> > >> Sincerely, > >> Russell > >> > >> On Wed, Jun 30, 2021 at 10:24 AM Julien Clavel < > julien.cla...@hotmail.fr> > >> wrote: > >> > >>> I think that Cécile' and Theodore' point is important and too often > >>> overlooked. Using GLS models, the BLUP (Best Linear Unbiased > Prediction) is > >>> not simply obtained from the fitted line but should incorporates > >>> information from the (evolutionary here) model. > >>> > >>> For multivariate linear model you can also do it by specifying a tree > >>> including both the species used to build the model and the ones you > want to > >>> predict using the “predict” function in mvMORPH (I think that > Rphylopars > >>> can deal with multivariate phylogenetic regression too). > >>> > >>> Regarding the model comparison, I would simply avoid it (or limit it) > by > >>> fitting models flexible enough to accommodate between your BM and OLS > case > >>> and summarize the results obtained across all the trees… > >>> > >>> Julien > >>> > >>> > >>> De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de > >>> Theodore Garland <theodore.garl...@ucr.edu> > >>> Envoyé : mercredi 30 juin 2021 03:26 > >>> À : Cecile Ane <cecile....@wisc.edu> > >>> Cc : mailman, r-sig-phylo <r-sig-phylo@r-project.org>; > >>> neovenatori...@gmail.com <neovenatori...@gmail.com> > >>> Objet : Re: [R-sig-phylo] Model Selection and PGLS > >>> > >>> All true. I would just add two things. First, always graph your data > and > >>> do ordinary OLS analyses as a reality check. > >>> > >>> Second, I think this is the original paper for phylogenetic prediction: > >>> Garland, Jr., T., and A. R. Ives. 2000. Using the past to predict the > >>> present: confidence intervals for regression equations in phylogenetic > >>> comparative methods. American Naturalist 155:346–364. > >>> There, we talk about the Equivalency of the Independent-Contrasts and > >>> Generalized Least Squares Approaches. > >>> > >>> Cheers, > >>> Ted > >>> > >>> > >>> On Tue, Jun 29, 2021 at 5:01 PM Cecile Ane <cecile....@wisc.edu> > wrote: > >>> > >>>> Hi Russel, > >>>> > >>>> What you see is the large uncertainty in “ancestral” states, which is > >>> part > >>>> of the intercept here. The linear relationship that you overlaid on > top > >>> of > >>>> your data is the relationship predicted at the root of the tree (as if > >>> such > >>>> a thing existed!). There is a lot of uncertainty about the intercept, > >>> but > >>>> much less uncertainty in the slope. It looks like the slope is not > >>> affected > >>>> by the inclusion or exclusion of monotremes. (for one possible > >>> reference on > >>>> the greater precision in the slope versus the intercept, there’s this: > >>>> http://dx.doi.org/10.1214/13-AOS1105 for the BM). > >>>> > >>>> My second cent is that the phylogenetic predictions should be stable. > >>> The > >>>> uncertainty in the intercept —and the large effect of including > >>> monotremes > >>>> on the intercept— should not affect predictions, so long as you know > for > >>>> which species you want to make a prediction. If you want to make > >>> prediction > >>>> for a species in a small clade “far” from monotremes, say, then the > >>>> prediction is probably quite stable, even if you include monotremes: > >>> this > >>>> is because the phylogenetic prediction should use the phylogenetic > >>>> relationships for the species to be predicted. A prediction that uses > >>> the > >>>> linear relationship at the root and ignores the placement of the > species > >>>> would be the worst-case scenario: for a mammal species with a > completely > >>>> unknown placement within mammals. > >>>> > >>>> There’s probably a number of software that do phylogenetic > prediction. I > >>>> know of Rphylopars and PhyloNetworks. > >>>> > >>>> my 2 cents… > >>>> Cecile > >>>> > >>>> --- > >>>> Cécile Ané, Professor (she/her) > >>>> H. I. Romnes Faculty Fellow > >>>> Departments of Statistics and of Botany > >>>> University of Wisconsin - Madison > >>>> www.stat.wisc.edu/~ane/<http://www.stat.wisc.edu/~ane/> > >>>> > >>>> CALS statistical consulting lab: > >>>> https://calslab.cals.wisc.edu/stat-consulting/ > >>>> > >>>> > >>>> > >>>> On Jun 29, 2021, at 5:37 PM, neovenatori...@gmail.com<mailto: > >>>> neovenatori...@gmail.com> wrote: > >>>> > >>>> Dear All, > >>>> > >>>> So this is the main problem I'm facing (see attached figure, which > >>> should > >>>> be small enough to post). When I calculate the best-fit line under a > >>>> Brownian model, this produces a best-fit line that more or less > bypasses > >>>> the distribution of the data altogether. I did some testing and found > >>> that > >>>> this result was driven solely by the presence of Monotremata, > resulting > >>> in > >>>> the model heavily downweighting all of the phylogenetic variation > within > >>>> Theria in favor of the deep divergence between Monotremata and Theria. > >>>> Excluding Monotremata produces a PGLS fit that's comparable enough to > >>> the > >>>> OLS and OU model fit to be justifiable (though I can't just throw out > >>>> Monotremata for the sake of throwing it out). > >>>> > >>>> I am planning to do a more theoretical investigation into the effect > of > >>>> Monotremata on the PGLS fit in a future study, but right now what I am > >>>> trying to do is perform a study in which I use this data to construct > a > >>>> regression model that can be used to predict new data. Which is why I > am > >>>> trying to use AIC to potentially justify going with OLS or an OU model > >>> over > >>>> a Brownian model. From a practical perspective the Brownian model is > >>> almost > >>>> unusable because it produces systematically biased estimates with high > >>>> error rates when applied to new data (error rate is roughly double > that > >>> of > >>>> both the OLS and OU model). This is especially the case because the > data > >>>> must be back-transformed into an arithmetic scale to be useable, and > >>> thus a > >>>> seemingly minor difference in regression models results in a massive > >>>> difference in predicted values. However, I need some objective test to > >>> show > >>>> that OLS fits the data better than the Brownian model, hence why I was > >>>> going with AIC. Overall, OLS does seem to outperform the Brownian > model > >>> on > >>>> average, but the variation in AIC is so high it is hard to interpret > >>> this. > >>>> This is kind of why I am leery of assuming a null Brownian model. A > >>>> Brownian model, if anything, does not seem to accurately model the > >>>> relationship between variables. > >>>> > >>>> This is why I am having trouble figuring out how to do model > selection. > >>>> Just going with accuracy statistics like percent error or standard > >>> error of > >>>> the estimate OLS is better from a purely practical sense (it doesn't > >>> work > >>>> for the monotreme taxa, but it turns out that estimate error in the > >>>> monotremes is only decreased by 10% in a Brownian model when it > >>>> overestimates mass by nearly 75%, so the improvement really isn't > worth > >>> it > >>>> and using this for monotremes isn't recommended in the first place), > but > >>>> the reviewers are expressing skepticism over the fact that the > Brownian > >>>> model produces less useable results. And I'm not entirely sure the > best > >>> way > >>>> to go about the PGLS if using one of the birth-death trees isn't > ideal, > >>>> perhaps what Dr. Upham says about using the DNA tree might work > better. > >>>> > >>>> Ironically, an OU model might be argued to better fit the data, > despite > >>>> the concerns that Dr. Bapst mentioned. Looking at the distribution of > >>>> signal even though signal is not random, it is more accurately > >>> described as > >>>> most taxa hewing to a stable equilibrium with rapid, high magnitude > >>> shifts > >>>> at certain evolutionary nodes, rather than the covariation between the > >>> two > >>>> traits evolving in a Brownian fashion. I did some experiments with a > PSR > >>>> curve and the results seem to favor an OU model or other models with > >>> uneven > >>>> rates of evolution rather than a pure Brownian model. > >>>> > >>>> Of course, the broader issue I am facing is trying to deal with PGLS > >>>> succinctly; the scope of the study isn't necessarily an in-depth > >>> comparison > >>>> between different regression models, it's more looking at how this > >>> variable > >>>> correlates with body mass for practical purposes (for which > considering > >>>> phylogeny is one part of that). It's definitely something to consider > >>> but I > >>>> am trying to avoid manuscript bloat. > >>>> > >>>> Sincerely, > >>>> Russell > >>>> > >>>> > >>>> [[alternative HTML version deleted]] > >>>> > >>>> _______________________________________________ > >>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org > >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > >>>> Searchable archive at > >>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/ > >>>> > >>> [[alternative HTML version deleted]] > >>> > >>> _______________________________________________ > >>> R-sig-phylo mailing list - R-sig-phylo@r-project.org > >>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > >>> Searchable archive at > >>> http://www.mail-archive.com/r-sig-phylo@r-project.org/ > >> > > [[alternative HTML version deleted]] > > > > _______________________________________________ > > R-sig-phylo mailing list - R-sig-phylo@r-project.org > > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > > -- > Simone Blomberg, BSc (Hons), PhD, MAppStat (she/her) > Senior Lecturer and Consultant Statistician > School of Biological Sciences > The University of Queensland > St. Lucia Queensland 4072 > Australia > T: +61 7 3365 2506 > email: S.Blomberg1_at_uq.edu.au > Twitter: @simoneb66 > UQ ALLY Supporting the diversity of sexuality and gender at UQ. > > Policies: > 1. I will NOT analyse your data for you. > 2. Your deadline is your problem. > > Basically, I'm not interested in doing research > and I never have been. I'm interested in > understanding, which is quite a different thing. > And often to understand something you have to > work it out for yourself because no one else > has done it. - David Blackwell > > [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/