Hi,

Thanks for the Allman & Rhodes paper, it is very nice. For me at least it confirms my suspicions, but made me realise that claims of asymmetric transition rates are only suspicious if you are unprepared to make some (strong?) assumptions. If anyone disagrees with what I have written below, then please tell me and I will try again to understand this stuff:

Identifiability is achieved because the pdf for the root state is the stationary distribution (denoted by sigma in Allman & Rhodes: see example 1). This is, I believe, the default in newer versions of Mesquite, although in older versions the distribution is 0.5/0.5 as in ace.

If the pdf of the root state is defined by an additional parameter, this leaves a single parameter to describe the rate of transitions, and asymmetrical transition rates are non-identifiable. It seems to me there is a choice to be made between a) assuming the same processes after the root held before the root and talk about asymmetric transition rates or b) do not make this assumption and then admit that the rates of transition from 0->1 and 1->0 are not separable. I don't think the data can be used to distinguish between these view points, and so its a matter of personal choice which interpretation/model is used.

Cheers,

Jarrod








Quoting Mark Holder <mthol...@ku.edu> on Thu, 16 Aug 2012 23:41:45 -0500:

Hi,

I agree that model testing between ARD vs MK models is going to be misleading when the process is really described by a threshold model (and sorry for ignoring that set of simulations by Jarrod previously; somehow I misfiled that email and didn't see it).

The threshold model has nice ways of dealing with correlations among characters. However, when it is applied as the underlying model for a single binary character (as in Jarrod's sims), the threshold model is similar to the single-site version of the covarion model (Tuffley and Steel's version).

I don't think the models are identical, but they are quite similar. I suspect that if you generated a data set under one of the models, it would be quite hard to determine which was the generating model. Instead of just having a an "on" and "off" state (as in the covarion model), the threshold model has a continuum (the further the underlying continuous trait is from boundary, the more "off" the observable binary trait is). Allman and Rhodes (2009, ref below) proved some results on the identifiability of generalizations of covarion processes. They considered models with more hidden rate categories (not just rate of zero and an rate of evolution when in the "on" state). I believe that their results were that the number of hidden rate categories that you can identify cannot exceed the number of observable states. So it may be hard to get much richer than the Tuffley+Steel covarion when you have a binary character.

Which is a long way of saying that, it might be worth looking at the covarion model variants for the types of data that Jarrod is interested in. Implementations of the covarion model for two states is quite fast and tractable. Testing Mk+covarion vs ARD+covarion may indeed be a more robust way of
detecting asymmetry in rates of character transitions compared to Mk vs ARD.


Thanks for pointing out the Boettiger et al paper, Matt.


all the best,
Mark


[1] E. S. Allman and J. A. Rhodes, “The Identifiability of Covarion Models in Phylogenetics,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 6, no. 1, pp. 76–88, Jan. 2009.




On Aug 16, 2012, at 10:07 PM, Matt Pennell wrote:

correction: the last sentence should have read

I wonder how that would work in this case. I think these are important questions going forward.

On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell <mwpenn...@gmail.com> wrote:
Hey all,

This has been a really fantastic discussion. Mark, you make some really excellent points in response to my earlier comments. I think you are correct in this.

The question that arises out of Jarrod and Dan's simulations (which I have just run) is whether a model selection criteria would be able to distinguish MK from the threshold model that Felsenstein (and Wright before him) put forth? And how do we best assess model adequacy? Carl Boettiger and company (2012: Evolution) suggested a Phylogenetic Monte Carlo approach for continuous characters. I wonder how that would before I think these are important questions going forward.

thanks again,
matt



On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky <drabo...@umich.edu> wrote:

Hi all-

A couple of points. I am actually less concerned about the Type I error rates I gave in that previous message for the equal rates markov process, even though I think they are real (e.g., I can corroborate them using Diversitree). I don't think it is an issue of ascertainment bias, but I think Mark may be right about the LRT being inappropriate with few events on the tree and this may well explain the matter. This is probably worth exploring further.

However, I am much more concerned about Jarrod's second model (with underlying continuous latent variable). This seems to be a serious problem, and if you simulate under the latent model, I think he is right that Type I error rates are really, really high. The model is reasonable: there is a continuous trait that influences the probability of observing a particular tip state. In a practical sense, this probably means the following: (i) some clades in the tree will essentially be fixed for the character in question. (ii) Other clades will appear to have high lability of the character. The clades that are fixed will be those clades where the underlying threshold character (e.g., mean clade value) drifts towards -Inf (or +Inf). Regardless of whether we think about this latent model, this at least leads to an interesting - and probably quite relevant - form of model misspecification. The model essentially induces some extra heterogeneity in rates, such that some clades will appear to be switching quickly and others slowly. However, it is still a symmetric model of sorts.

You can simulate data easily under this model and verify (using whatever software) that it is a problem. I'm attaching code that does this. You can play around with 3 parameters: (i) the number of taxa in the analysis (set to 100); (ii) the expected variance of the continuous latency factor (from roots to tip); and (iii) the root state. These parameters are NTAXA, tipvar, and root in the code below.

I'm keen to see what others think, but it looks to me like you can simulate very reasonable-looking datasets and obtain extremely strong support for an asymmetric model - even though the model is quasi-symmetric. So, if these hold, then I think this is a serious issue - nothing we routinely do in the analysis of discrete characters is designed to detect this sort of model misspecification.

## A single simulation

library(diversitree);
library(geiger);
library(mvtnorm);

NTAXA <- 101;

# Generate the tree:
x <- birthdeath.tree(b=1, d=0, taxa.stop=NTAXA);
x <- drop.tip(x, x$tip.label[x$edge[,2][x$edge.length==0][1]]);


vv <- vcv.phylo(x); # get phylogenetic vcv matrix

# Now we set the expected variance at the tips:
#       e.g., the value we want for the diagonal of the vcv matrix
# If this is = 1, you'll have a "phylogenetic" standard normal distribution
#

tipvar <- 2;
sf <- tipvar/max(vv); #get scale factor for vcv matrix

vmat <- vv*sf; # scale matrix
root <- 0; #root state: this assumes the root is equally likely to give either state

mu <- rep(root, nrow(vmat));  #vector of means


# Simulate continuous, and then discrete, chars from
#               the corresponding mvn and binomial distributions
chars <- rmvnorm(1, mean=mu, sigma=vmat);
states <- rbinom(length(chars), 1, prob=plogis(chars));
names(states) <- x$tip.label;

# Look at the data...
plot(x, show.tip.label=F);
tiplabels(pch=21, bg = c('black', 'white')[(states+1)], col='black', cex=1);

#### Using Diverstree for model fitting:
lfx <- make.mk2(x, states); # The asymmetric likelihood function
lfxcon <- constrain(lfx, formulae = list(q01 ~ q10)); #constraining q01 ~ q10

# Estimation...
l2 <- find.mle(lfx, runif(2, 0, 5))$lnLik;
l1 <- find.mle(lfxcon, runif(1, 0, 5))$lnLik;

#likelihood ratio test
lrt <- -2*l1 + 2*l2;
1 - pchisq(lrt, df=1);


### End sim

Cheers,
~Dan




On Aug 16, 2012, at 9:22 PM, Mark Holder wrote:

> Hi all,
>       <apologies for the long email>
>
> I'm a bit more concerned with Dan's elevated Type-1 error rates than Jarrod's example.
>
> With respect to Jarrod's simulations, I have a few thoughts:
> 1. I don't understand the claim (in the original email) that "its fairly straightforward to prove that asymmetric transition rates cannot be identified using data collected on the tips of a phylogeny" It seems like this is something that is routinely done in phylogenetics, and proofs of identifiability of GTR exist (demonstrating that this indeed feasible and not some computational artifact).
>
> 2. I think that Dan's original response is correct. We should not expect to reject the null only 5% of the time if we simulate a bias in the states of the tips. Simulating data without respect to a phylogeny and then analyzing under Mk or "ARD" should be like a simulation in which the rate of character evolution is essentially infinite. The Mk model predicts the states to be equally frequent, and so it is not problematic for a deviation from 50:50 to favor the ARD model over the Mk model. In fact, the simulation model is equivalent the ARD model with a rate of evolution approaching infinity, so we should prefer the ARD model.
>
> 3. (in reference to Matt's post) In the ARD model, it is possible for the MLE of the equilibrium state frequency of the less frequent state to be >0.5. Presumably, this is a rare occurrence, but I don't agree with the characterization of the ratio of parameters converging to the ratio of states.
>
> Consider a clade with lots of tips in state 1 but tiny branch lengths. If this clade is found in the context of a tree with a few other branches that are long and lead to tips with state 0, then you can get an MLE of the state frequency for state 0 being > 0.5. Most of the tips will have state 1, but because they are easily explained by one transition to 1 you can still infer that the less frequent state has a higher equilibrium frequency.
>
> Perhaps, I'm mis-reading what Matt is referring to when he discusses an analysis of a tree with "a lot of tips (ie. approaching the limit)." I do agree that if you simulate a very large tree under ARD (with the frequencies not equal to 0.5), then the frequency of the states at the tips will converge to the equilibrium state frequencies.
>
>
>
> With respect to Dan's results:
>
> The Type I error rate of 0.12 troubles me. Have you tried exporting the data and seeing if other software agrees with the likelihood ratios returned by ace() ? I looked at the code for ace and nothing looked amiss to me (though my R skills are virtually non-existent).
>
> If the result is corroborated by other software, then my best guesses would be: > 1. ascertainment bias (the simulation model clearly excludes constant patterns, but I don't believe that the inference model in ace does any correction for this), and/or > 2. the assumption that you can use the chi-square as the null distribution for the LRT probably breaks down when you have very few events on the tree. In some sense the number of events is our measure of the amount of data, and when we have very few events on the tree the asymptotic behavior of the LRT under the null is probably not going to help us.
>
> In the limiting case, when rates of character change are so low that you never see homoplasy, then I think the LRT of the the two models should get close to 1 on virtually any realization (conditional on starting in state 1 and having exactly 1 change on the tree, both model make the same predictions about the data; so in this realm the data should not prefer one model over the other). So, I'm not sure how the small data explanation would explain your observation of an excess of large LRT statistics.
>
>
> those are my 2 cents.
>
> all the best,
> Mark
>
>

_____________________
Dan Rabosky
Assistant Professor
Dept of Ecology and Evolutionary Biology
& Museum of Zoology
University of Michigan
drabo...@umich.edu




Mark Holder

mthol...@ku.edu
http://phylo.bio.ku.edu/mark-holder

==============================================
Department of Ecology and Evolutionary Biology
University of Kansas
6031 Haworth Hall
1200 Sunnyside Avenue
Lawrence, Kansas 66045

lab phone:  785.864.5789

fax (shared): 785.864.5860
==============================================












--
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Reply via email to