Dear Russell, I was not talking about the OLS residuals' (which is indeed expected to behave better than GLS when there are errors in variables – see the reference I cited last time) but about the residuals of your GLS fit, since this is this one which have apparently a suspect slope.
Note also that if there are trends driven by two clades in your residuals, this is likely a model design issue (e.g., a single regression line is not sufficient to model your data. It seems that you have a different relationship within Cetacea in your plot for instance). Using a (P)GLS instead of OLS will not solve the problem, both approaches offer an unbiased estimate for the slope! Provided there’s no observation errors… but when this happens you can try to correct the bias using, for instance, the reliability ratio. Yes, the sampling error can have mixed sources (some can be “biological”). If you have only information about the variance for some species maybe you can still approximate the value for the others by using a pooled estimate? Best wishes, Julien De : Russell Engelman <neovenatori...@gmail.com> Envoyé : vendredi 22 octobre 2021 00:54 À : Julien Clavel <julien.cla...@hotmail.fr>; mailman, r-sig-phylo <r-sig-phylo@r-project.org> Objet : Re: [R-sig-phylo] Irregularity in PGLS Slope Driven By Scope of Taxon Selection Dear Dr. Clavel, If you plot the residuals against your predictor they will likely be correlated in this case. I'm not sure this is the case. For the OLS fit, when I plot a residuals versus fits plot the results are mostly linear and suggestive of normality. There is some non-random distribution of the residuals, but this is driven by two clades that end up biasing the fit and is part of the reason I am trying to see if PGLS methods produce more reasonable results. The scale-location plot suggests increasing variance in residuals with increasing size, but this also appears to be driven by the two clades that were biasing the fit under OLS and overall show reduced correlation between brain and body size. Thus the heteroskedasticity in this plot is driven by biological variation rather than measurement error. Excluding these two groups produces a scale-location plot where the log residuals are homoskedastic. I would guess that there’s likely less or as much uncertainty in the estimate of brain size than for body size across mammals if both were independently estimated. This seems to be what Pagel and Harvey (1988) were suggesting, that somehow error variation in body size was driving shallower slopes in body size among mammals (within-genus regressions had shallow slopes, then within-family, then within-order). However, it wasn't quite clear what they meant by sampling error (e.g., the imprecision in the actual measurement, or the intraspecific variation in body mass due to body condition). I think it sounds reasonable that this is probably the case. Assuming you can obtain an estimate for this error, it’s usually possible to correct this bias. An alternative is to include another “instrumental variable” as covariate. How would one go about doing this? I ask because most of this data comes from prior literature sources and many times standard deviations in the variables are not reported. Some of the data come from single individuals due to limited availability of specimens in the parent study(/ies). I saw that Hansen & Bartoszek 2012 mention a "reliability ratio" that they used to correct the data, but I'm not exactly sure if this is the same thing. Sincerely, Russell On Wed, Oct 20, 2021 at 10:21 AM Julien Clavel <julien.cla...@hotmail.fr> wrote: Hi Russell, Just a hint, but this type of bias (assuming there’s no formatting issues with the data), often shows up when there’s considerable (non-random) errors in the predictors (we talk about "error in variable models"). If you plot the residuals against your predictor they will likely be correlated in this case. I would guess that there’s likely less or as much uncertainty in the estimate of brain size than for body size across mammals if both were independently estimated. You can see for instance Morton-Jones & Henderson 2000 (Technometrics) for GLS in general, and Hansen & Bartoszek 2012 (Systematic Biology) for the (P)GLS case. Assuming you can obtain an estimate for this error, it’s usually possible to correct this bias. An alternative is to include another “instrumental variable” as covariate. Best wishes, Julien De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de Russell Engelman <neovenatori...@gmail.com> Envoyé : mercredi 20 octobre 2021 04:29 À : mailman, r-sig-phylo <r-sig-phylo@r-project.org> Objet : [R-sig-phylo] Irregularity in PGLS Slope Driven By Scope of Taxon Selection Dear R-Sig-Phylo, I'm having a very strange issue with PGLS in R and I was wondering if anyone had seen this before. I've been doing some work with brain size in mammals, using the dataset of Burger et al. 2019 as a base. The data here is using the dataset of Burger et al. 2019, but it happens as well with my own data. I have been trying to calculate a PGLS fit based on the suggestions of some previous authors that the best fit line is biased by delphinoids and anthropoid primates. However, the best fit line I get does not follow the data at all, whether this line is calculated for all rodents or all mammals. At first I thought maybe the PGLS best fit line was simply very different when phylogenetic covariance is minimized, but then I found out this wasn't the case at all. Many other studies such as Boddy et al. (2012) used PGLS and got slopes that looked reasonable. E.g., Boddy et al. (2012) got slopes of log-brain size to log-body size of 0.63-0.68, which makes sense given the distribution of the data, whereas the dataset I have here gives a slope of 0.51, which completely bypasses the linear distribution of the data. Notably, the data I have here isn't distributed in a way that suggests the OLS fit is driven by Here's where it gets even stranger. On a suggestion from my co-author I performed a PGLS fit using only the median species from each family, such that no one clade would have a huge influence on the regression and the PGLS would be making comparisons between higher-level clades. The best-fit lines for the family-level regression had a much higher slope than for treating each species individually, such that the PGLS line was pretty close to the OLS. I have no idea why this is occurring. I can't figure out why the PGLS function is consistently producing a line that does not follow the distribution of the data at all, even when the data is subsetted to more restricted taxonomic intervals. It is especially unclear why reducing the dataset to "one species per family" results in a dramatically lower slope. The closest thing I can think of is this issue noted by Pagel and Harvey 1988, who noted there was some kind of methodological issue where restricting taxonomic scope resulted in increasingly lower slopes due to some kind of mathematical issue that wasn't clear when I read the paper. What I'm wondering is if there is a tendency for the slope of the regression at very narrow taxonomic intervals (e.g., within-genus) to be flatter, then if more of the comparisons in a species-level regression are between closely related taxa, will that result in the PGLS model being influenced to have a lower slope since more of the comparisons in the covariation matrix are between closely related taxa. I did also try an OU model, but the OU model also gave suspicious results. Specifically, it gave results that were near identical to the OLS, when there is good reason to believe the OLS slope is biased by the presence of large-brained cetaceans and primates. Previous studies found the PGLS slope to be much lower than OLS because of this, and the data here even finds excluding these taxa results in a lower slope. Sincerely, Russell _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/