Dear Russell,

I was not talking about the OLS residuals' (which is indeed expected to behave 
better than GLS when there are errors in variables – see the reference I cited 
last time) but about the residuals of your GLS fit, since this is this one 
which have apparently a suspect slope.

Note also that if there are trends driven by two clades in your residuals, this 
is likely a model design issue (e.g., a single regression line is not 
sufficient to model your data. It seems that you have a different relationship 
within Cetacea in your plot for instance). Using a (P)GLS instead of OLS will 
not solve the problem, both approaches offer an unbiased estimate for the 
slope! Provided there’s no observation errors… but when this happens you can 
try to correct the bias using, for instance, the reliability ratio.

Yes, the sampling error can have mixed sources (some can be “biological”). If 
you have only information about the variance for some species maybe you can 
still approximate the value for the others by using a pooled estimate?

Best wishes,

Julien


De : Russell Engelman <neovenatori...@gmail.com>
Envoyé : vendredi 22 octobre 2021 00:54
À : Julien Clavel <julien.cla...@hotmail.fr>; mailman, r-sig-phylo 
<r-sig-phylo@r-project.org>
Objet : Re: [R-sig-phylo] Irregularity in PGLS Slope Driven By Scope of Taxon 
Selection 
 
Dear Dr. Clavel, 
 
If you plot the residuals against your predictor they will likely be correlated 
in this case.

I'm not sure this is the case. For the OLS fit, when I plot a residuals versus 
fits plot the results are mostly linear and suggestive of normality. There is 
some non-random distribution of the residuals, but this is driven by two clades 
that end up biasing the fit and is part of the reason I am trying to see if 
PGLS methods produce more reasonable results.

The scale-location plot suggests increasing variance in residuals with 
increasing size, but this also appears to be driven by the two clades that were 
biasing the fit under OLS and overall show reduced correlation between brain 
and body size. Thus the heteroskedasticity in this plot is driven by biological 
variation rather than measurement error. Excluding these two groups produces a 
scale-location plot where the log residuals are homoskedastic.

I would guess that there’s likely less or as much uncertainty in the estimate 
of brain size than for body size across mammals if both were independently 
estimated. 

This seems to be what Pagel and Harvey (1988) were suggesting, that somehow 
error variation in body size was driving shallower slopes in body size among 
mammals (within-genus regressions had shallow slopes, then within-family, then 
within-order). However, it wasn't quite clear what they meant by sampling error 
(e.g., the imprecision in the actual measurement, or the intraspecific 
variation in body mass due to body condition). I think it sounds reasonable 
that this is probably the case.

Assuming you can obtain an estimate for this error, it’s usually possible to 
correct this bias. An alternative is to include another “instrumental variable” 
as covariate.

How would one go about doing this? I ask because most of this data comes from 
prior literature sources and many times standard deviations in the variables 
are not reported. Some of the data come from single individuals due to limited 
availability of specimens in the parent study(/ies). I saw that Hansen & 
Bartoszek 2012 mention a "reliability ratio" that they used to correct the 
data, but I'm not exactly sure if this is the same thing.

Sincerely,
Russell

On Wed, Oct 20, 2021 at 10:21 AM Julien Clavel <julien.cla...@hotmail.fr> wrote:
Hi Russell,

Just a hint, but this type of bias (assuming there’s no formatting issues with 
the data), often shows up when there’s considerable (non-random) errors in the 
predictors (we talk about "error in variable models"). If you plot the 
residuals against your predictor they will likely be correlated in this case. I 
would guess that there’s likely less or as much uncertainty in the estimate of 
brain size than for body size across mammals if both were independently 
estimated. You can see for instance Morton-Jones & Henderson 2000 
(Technometrics) for GLS in general, and Hansen & Bartoszek 2012 (Systematic 
Biology) for the (P)GLS case.

Assuming you can obtain an estimate for this error, it’s usually possible to 
correct this bias. An alternative is to include another “instrumental variable” 
as covariate.

Best wishes,

Julien


De : R-sig-phylo <r-sig-phylo-boun...@r-project.org> de la part de Russell 
Engelman <neovenatori...@gmail.com>
Envoyé : mercredi 20 octobre 2021 04:29
À : mailman, r-sig-phylo <r-sig-phylo@r-project.org>
Objet : [R-sig-phylo] Irregularity in PGLS Slope Driven By Scope of Taxon 
Selection 
 
Dear R-Sig-Phylo,

I'm having a very strange issue with PGLS in R and I was wondering if anyone 
had seen this before. 

I've been doing some work with brain size in mammals, using the dataset of 
Burger et al. 2019 as a base. The data here is using the dataset of Burger et 
al. 2019, but it happens as well with my own data.

I have been trying to calculate a PGLS fit based on the suggestions of some 
previous authors that the best fit line is biased by delphinoids and anthropoid 
primates. However, the best fit line I get does not follow the data at all, 
whether this line is calculated for all rodents or all mammals. At first I 
thought maybe the PGLS best fit line was simply very different when 
phylogenetic covariance is minimized, but then I found out this wasn't the case 
at all. Many other studies such as Boddy et al. (2012) used PGLS and got slopes 
that looked reasonable. E.g., Boddy et al. (2012) got slopes of log-brain size 
to log-body size of 0.63-0.68, which makes sense given the distribution of the 
data, whereas the dataset I have here gives a slope of 0.51, which completely 
bypasses the linear distribution of the data.

Notably, the data I have here isn't distributed in a way that suggests the OLS 
fit is driven by 

Here's where it gets even stranger. On a suggestion from my co-author I 
performed a PGLS fit using only the median species from each family, such that 
no one clade would have a huge influence on the regression and the PGLS would 
be making comparisons between higher-level clades. The best-fit lines for the 
family-level regression had a much higher slope than for treating each species 
individually, such that the PGLS line was pretty close to the OLS.

I have no idea why this is occurring. I can't figure out why the PGLS function 
is consistently producing a line that does not follow the distribution of the 
data at all, even when the data is subsetted to more restricted taxonomic 
intervals. It is especially unclear why reducing the dataset to "one species 
per family" results in a dramatically lower slope. The closest thing I can 
think of is this issue noted by Pagel and Harvey 1988, who noted there was some 
kind of methodological issue where restricting taxonomic scope resulted in 
increasingly lower slopes due to some kind of mathematical issue that wasn't 
clear when I read the paper.

What I'm wondering is if there is a tendency for the slope of the regression at 
very narrow taxonomic intervals (e.g., within-genus) to be flatter, then if 
more of the comparisons in a species-level regression are between closely 
related taxa, will that result in the PGLS model being influenced to have a 
lower slope since more of the comparisons in the covariation matrix are between 
closely related taxa.

I did also try an OU model, but the OU model also gave suspicious results. 
Specifically, it gave results that were near identical to the OLS, when there 
is good reason to believe the OLS slope is biased by the presence of 
large-brained cetaceans and primates. Previous studies found the PGLS slope to 
be much lower than OLS because of this, and the data here even finds excluding 
these taxa results in a lower slope.

Sincerely,
Russell
_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to