correction: the last sentence should have read I wonder how that would work in this case. I think these are important questions going forward.
On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell <mwpenn...@gmail.com> wrote: > Hey all, > > This has been a really fantastic discussion. Mark, you make some really > excellent points in response to my earlier comments. I think you are > correct in this. > > The question that arises out of Jarrod and Dan's simulations (which I have > just run) is whether a model selection criteria would be able to > distinguish MK from the threshold model that Felsenstein (and Wright before > him) put forth? And how do we best assess model adequacy? Carl Boettiger > and company (2012: Evolution) suggested a Phylogenetic Monte Carlo approach > for continuous characters. I wonder how that would before I think these > are important questions going forward. > > thanks again, > matt > > > > On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky <drabo...@umich.edu> wrote: > >> >> Hi all- >> >> A couple of points. I am actually less concerned about the Type I error >> rates I gave in that previous message for the equal rates markov process, >> even though I think they are real (e.g., I can corroborate them using >> Diversitree). I don't think it is an issue of ascertainment bias, but I >> think Mark may be right about the LRT being inappropriate with few events >> on the tree and this may well explain the matter. This is probably worth >> exploring further. >> >> However, I am much more concerned about Jarrod's second model (with >> underlying continuous latent variable). This seems to be a serious problem, >> and if you simulate under the latent model, I think he is right that Type I >> error rates are really, really high. The model is reasonable: there is a >> continuous trait that influences the probability of observing a particular >> tip state. In a practical sense, this probably means the following: (i) >> some clades in the tree will essentially be fixed for the character in >> question. (ii) Other clades will appear to have high lability of the >> character. The clades that are fixed will be those clades where the >> underlying threshold character (e.g., mean clade value) drifts towards -Inf >> (or +Inf). Regardless of whether we think about this latent model, this at >> least leads to an interesting - and probably quite relevant - form of model >> misspecification. The model essentially induces some extra heterogeneity in >> rates, such that some clades will appear to be switching quickly and others >> slowly. However, it is still a symmetric model of sorts. >> >> You can simulate data easily under this model and verify (using whatever >> software) that it is a problem. I'm attaching code that does this. You can >> play around with 3 parameters: (i) the number of taxa in the analysis (set >> to 100); (ii) the expected variance of the continuous latency factor (from >> roots to tip); and (iii) the root state. These parameters are NTAXA, >> tipvar, and root in the code below. >> >> I'm keen to see what others think, but it looks to me like you can >> simulate very reasonable-looking datasets and obtain extremely strong >> support for an asymmetric model - even though the model is quasi-symmetric. >> So, if these hold, then I think this is a serious issue - nothing we >> routinely do in the analysis of discrete characters is designed to detect >> this sort of model misspecification. >> >> ## A single simulation >> >> library(diversitree); >> library(geiger); >> library(mvtnorm); >> >> NTAXA <- 101; >> >> # Generate the tree: >> x <- birthdeath.tree(b=1, d=0, taxa.stop=NTAXA); >> x <- drop.tip(x, x$tip.label[x$edge[,2][x$edge.length==0][1]]); >> >> >> vv <- vcv.phylo(x); # get phylogenetic vcv matrix >> >> # Now we set the expected variance at the tips: >> # e.g., the value we want for the diagonal of the vcv matrix >> # If this is = 1, you'll have a "phylogenetic" standard normal >> distribution >> # >> >> tipvar <- 2; >> sf <- tipvar/max(vv); #get scale factor for vcv matrix >> >> vmat <- vv*sf; # scale matrix >> root <- 0; #root state: this assumes the root is equally likely to give >> either state >> >> mu <- rep(root, nrow(vmat)); #vector of means >> >> >> # Simulate continuous, and then discrete, chars from >> # the corresponding mvn and binomial distributions >> chars <- rmvnorm(1, mean=mu, sigma=vmat); >> states <- rbinom(length(chars), 1, prob=plogis(chars)); >> names(states) <- x$tip.label; >> >> # Look at the data... >> plot(x, show.tip.label=F); >> tiplabels(pch=21, bg = c('black', 'white')[(states+1)], col='black', >> cex=1); >> >> #### Using Diverstree for model fitting: >> lfx <- make.mk2(x, states); # The asymmetric likelihood function >> lfxcon <- constrain(lfx, formulae = list(q01 ~ q10)); #constraining q01 ~ >> q10 >> >> # Estimation... >> l2 <- find.mle(lfx, runif(2, 0, 5))$lnLik; >> l1 <- find.mle(lfxcon, runif(1, 0, 5))$lnLik; >> >> #likelihood ratio test >> lrt <- -2*l1 + 2*l2; >> 1 - pchisq(lrt, df=1); >> >> >> ### End sim >> >> Cheers, >> ~Dan >> >> >> >> >> On Aug 16, 2012, at 9:22 PM, Mark Holder wrote: >> >> > Hi all, >> > <apologies for the long email> >> > >> > I'm a bit more concerned with Dan's elevated Type-1 error rates than >> Jarrod's example. >> > >> > With respect to Jarrod's simulations, I have a few thoughts: >> > 1. I don't understand the claim (in the original email) that "its >> fairly straightforward to prove that asymmetric transition rates cannot be >> identified using data collected on the tips of a phylogeny" It seems like >> this is something that is routinely done in phylogenetics, and proofs of >> identifiability of GTR exist (demonstrating that this indeed feasible and >> not some computational artifact). >> > >> > 2. I think that Dan's original response is correct. We should not >> expect to reject the null only 5% of the time if we simulate a bias in the >> states of the tips. Simulating data without respect to a phylogeny and then >> analyzing under Mk or "ARD" should be like a simulation in which the rate >> of character evolution is essentially infinite. The Mk model predicts the >> states to be equally frequent, and so it is not problematic for a deviation >> from 50:50 to favor the ARD model over the Mk model. In fact, the >> simulation model is equivalent the ARD model with a rate of evolution >> approaching infinity, so we should prefer the ARD model. >> > >> > 3. (in reference to Matt's post) In the ARD model, it is possible >> for the MLE of the equilibrium state frequency of the less frequent state >> to be >0.5. Presumably, this is a rare occurrence, but I don't agree with >> the characterization of the ratio of parameters converging to the ratio of >> states. >> > >> > Consider a clade with lots of tips in state 1 but tiny branch >> lengths. If this clade is found in the context of a tree with a few other >> branches that are long and lead to tips with state 0, then you can get an >> MLE of the state frequency for state 0 being > 0.5. Most of the tips will >> have state 1, but because they are easily explained by one transition to 1 >> you can still infer that the less frequent state has a higher equilibrium >> frequency. >> > >> > Perhaps, I'm mis-reading what Matt is referring to when he >> discusses an analysis of a tree with "a lot of tips (ie. approaching the >> limit)." I do agree that if you simulate a very large tree under ARD (with >> the frequencies not equal to 0.5), then the frequency of the states at the >> tips will converge to the equilibrium state frequencies. >> > >> > >> > >> > With respect to Dan's results: >> > >> > The Type I error rate of 0.12 troubles me. Have you tried exporting >> the data and seeing if other software agrees with the likelihood ratios >> returned by ace() ? I looked at the code for ace and nothing looked amiss >> to me (though my R skills are virtually non-existent). >> > >> > If the result is corroborated by other software, then my best guesses >> would be: >> > 1. ascertainment bias (the simulation model clearly excludes >> constant patterns, but I don't believe that the inference model in ace does >> any correction for this), and/or >> > 2. the assumption that you can use the chi-square as the null >> distribution for the LRT probably breaks down when you have very few events >> on the tree. In some sense the number of events is our measure of the >> amount of data, and when we have very few events on the tree the asymptotic >> behavior of the LRT under the null is probably not going to help us. >> > >> > In the limiting case, when rates of character change are so low that >> you never see homoplasy, then I think the LRT of the the two models should >> get close to 1 on virtually any realization (conditional on starting in >> state 1 and having exactly 1 change on the tree, both model make the same >> predictions about the data; so in this realm the data should not prefer one >> model over the other). So, I'm not sure how the small data explanation >> would explain your observation of an excess of large LRT statistics. >> > >> > >> > those are my 2 cents. >> > >> > all the best, >> > Mark >> > >> > >> >> _____________________ >> Dan Rabosky >> Assistant Professor >> Dept of Ecology and Evolutionary Biology >> & Museum of Zoology >> University of Michigan >> drabo...@umich.edu >> >> > [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo