correction: the last sentence should have read

I wonder how that would work in this case. I think these are important
questions going forward.

On Thu, Aug 16, 2012 at 11:00 PM, Matt Pennell <mwpenn...@gmail.com> wrote:

> Hey all,
>
> This has been a really fantastic discussion. Mark, you make some really
> excellent points in response to my earlier comments. I think you are
> correct in this.
>
> The question that arises out of Jarrod and Dan's simulations (which I have
> just run) is whether a model selection criteria would be able to
> distinguish MK from the threshold model that Felsenstein (and Wright before
> him) put forth? And how do we best assess model adequacy? Carl Boettiger
> and company (2012: Evolution) suggested a Phylogenetic Monte Carlo approach
> for continuous characters. I wonder how that would before  I think these
> are important questions going forward.
>
> thanks again,
> matt
>
>
>
> On Thu, Aug 16, 2012 at 10:43 PM, Dan Rabosky <drabo...@umich.edu> wrote:
>
>>
>> Hi all-
>>
>> A couple of points. I am actually less concerned about the Type I error
>> rates I gave in that previous message for the equal rates markov process,
>> even though I think they are real (e.g., I can corroborate them using
>> Diversitree). I don't think it is an issue of ascertainment bias, but I
>> think Mark may be right about the LRT being inappropriate with few events
>> on the tree and this may well explain the matter. This is probably worth
>> exploring further.
>>
>> However, I am much more concerned about Jarrod's second model (with
>> underlying continuous latent variable). This seems to be a serious problem,
>> and if you simulate under the latent model, I think he is right that Type I
>> error rates are really, really high. The model is reasonable: there is a
>> continuous trait that influences the probability of observing a particular
>> tip state. In a practical sense, this probably means the following: (i)
>> some clades in the tree will essentially be fixed for the character in
>> question. (ii) Other clades will appear to have high lability of the
>> character. The clades that are fixed will be those clades where the
>> underlying threshold character (e.g., mean clade value) drifts towards -Inf
>> (or +Inf). Regardless of whether we think about this latent model, this at
>> least leads to an interesting - and probably quite relevant - form of model
>> misspecification. The model essentially induces some extra heterogeneity in
>> rates, such that some clades will appear to be switching quickly and others
>> slowly. However, it is still a symmetric model of sorts.
>>
>> You can simulate data easily under this model and verify (using whatever
>> software) that it is a problem. I'm attaching code that does this. You can
>> play around with 3 parameters: (i) the number of taxa in the analysis (set
>> to 100); (ii) the expected variance of the continuous latency factor (from
>> roots to tip); and (iii) the root state. These parameters are NTAXA,
>> tipvar, and root in the code below.
>>
>> I'm keen to see what others think, but it looks to me like you can
>> simulate very reasonable-looking datasets and obtain extremely strong
>> support for an asymmetric model - even though the model is quasi-symmetric.
>> So, if these hold, then I think this is a serious issue - nothing we
>> routinely do in the analysis of discrete characters is designed to detect
>> this sort of model misspecification.
>>
>> ## A single simulation
>>
>> library(diversitree);
>> library(geiger);
>> library(mvtnorm);
>>
>> NTAXA <- 101;
>>
>> # Generate the tree:
>> x <- birthdeath.tree(b=1, d=0, taxa.stop=NTAXA);
>> x <- drop.tip(x, x$tip.label[x$edge[,2][x$edge.length==0][1]]);
>>
>>
>> vv <- vcv.phylo(x); # get phylogenetic vcv matrix
>>
>> # Now we set the expected variance at the tips:
>> #       e.g., the value we want for the diagonal of the vcv matrix
>> #       If this is = 1, you'll have a "phylogenetic" standard normal
>> distribution
>> #
>>
>> tipvar <- 2;
>> sf <- tipvar/max(vv); #get scale factor for vcv matrix
>>
>> vmat <- vv*sf; # scale matrix
>> root <- 0; #root state: this assumes the root is equally likely to give
>> either state
>>
>> mu <- rep(root, nrow(vmat));  #vector of means
>>
>>
>> # Simulate continuous, and then discrete, chars from
>> #               the corresponding mvn and binomial distributions
>> chars <- rmvnorm(1, mean=mu, sigma=vmat);
>> states <- rbinom(length(chars), 1, prob=plogis(chars));
>> names(states) <- x$tip.label;
>>
>> # Look at the data...
>> plot(x, show.tip.label=F);
>> tiplabels(pch=21, bg = c('black', 'white')[(states+1)], col='black',
>> cex=1);
>>
>> #### Using Diverstree for model fitting:
>> lfx <- make.mk2(x, states); # The asymmetric likelihood function
>> lfxcon <- constrain(lfx, formulae = list(q01 ~ q10)); #constraining q01 ~
>> q10
>>
>> # Estimation...
>> l2 <- find.mle(lfx, runif(2, 0, 5))$lnLik;
>> l1 <- find.mle(lfxcon, runif(1, 0, 5))$lnLik;
>>
>> #likelihood ratio test
>> lrt <- -2*l1 + 2*l2;
>> 1 - pchisq(lrt, df=1);
>>
>>
>> ### End sim
>>
>> Cheers,
>> ~Dan
>>
>>
>>
>>
>> On Aug 16, 2012, at 9:22 PM, Mark Holder wrote:
>>
>> > Hi all,
>> >       <apologies for the long email>
>> >
>> > I'm a bit more concerned with Dan's elevated Type-1 error rates than
>> Jarrod's example.
>> >
>> > With respect to Jarrod's simulations, I have a few thoughts:
>> >       1. I don't understand the claim (in the original email) that "its
>> fairly straightforward to prove that asymmetric transition rates cannot be
>> identified using data collected on the tips of a phylogeny" It seems like
>> this is something that is routinely done in phylogenetics, and proofs of
>> identifiability of GTR exist (demonstrating that  this indeed feasible and
>> not some computational artifact).
>> >
>> >       2. I think that Dan's original response is correct. We should not
>> expect to reject the null only 5% of the time if we simulate a bias in the
>> states of the tips. Simulating data without respect to a phylogeny and then
>> analyzing under Mk or "ARD" should be like a simulation in which the rate
>> of character evolution is essentially infinite.  The Mk model predicts the
>> states to be equally frequent, and so it is not problematic for a deviation
>> from 50:50 to favor the ARD model over the Mk model. In fact, the
>> simulation model is equivalent the ARD model with a rate of evolution
>> approaching infinity, so we should prefer the ARD model.
>> >
>> >       3. (in reference to Matt's post) In the ARD model, it is possible
>> for the MLE of the equilibrium state frequency of the less frequent state
>> to be >0.5. Presumably, this is a rare occurrence, but I don't agree with
>> the characterization of the ratio of parameters converging to the ratio of
>> states.
>> >
>> >       Consider a clade with lots of tips in state 1 but tiny branch
>> lengths. If this clade is found in the context of a tree with a few other
>> branches that are long and lead to tips with state 0, then you can get an
>> MLE of the state frequency for state 0 being > 0.5. Most of the tips will
>> have state 1, but because they are easily explained by one transition to 1
>> you can still infer that the less frequent state has a higher equilibrium
>> frequency.
>> >
>> >       Perhaps, I'm mis-reading what Matt is referring to when he
>> discusses an analysis of a tree with "a lot of tips (ie. approaching the
>> limit)." I do agree that if you simulate a very large tree under ARD (with
>> the frequencies not equal to 0.5), then the frequency of the states at the
>> tips will converge to the equilibrium state frequencies.
>> >
>> >
>> >
>> > With respect to Dan's results:
>> >
>> > The Type I error rate of 0.12 troubles me.  Have you tried exporting
>> the data and seeing if other software agrees with the likelihood ratios
>> returned by ace() ?  I looked at the code for ace and nothing looked amiss
>> to me (though my R skills are virtually non-existent).
>> >
>> > If the result is corroborated by other software, then my best guesses
>> would be:
>> >       1. ascertainment bias (the simulation model clearly excludes
>> constant patterns, but I don't believe that the inference model in ace does
>> any correction for this), and/or
>> >       2. the assumption that you can use the chi-square as the null
>> distribution for the LRT probably breaks down when you have very few events
>> on the tree. In some sense the number of events is our measure of the
>> amount of data, and when we have very few events on the tree the asymptotic
>> behavior of the LRT under the null is probably not going to help us.
>> >
>> > In the limiting case, when rates of character change are so low that
>> you never see homoplasy, then I think the LRT of the the two models should
>> get close to 1 on virtually any realization (conditional on starting in
>> state 1 and having exactly 1 change on the tree, both model make the same
>> predictions about the data; so in this realm the data should not prefer one
>> model over the other).  So, I'm not sure how the small data explanation
>> would explain your observation of an excess of large LRT statistics.
>> >
>> >
>> > those are my 2 cents.
>> >
>> > all the best,
>> > Mark
>> >
>> >
>>
>> _____________________
>> Dan Rabosky
>> Assistant Professor
>> Dept of Ecology and Evolutionary Biology
>> & Museum of Zoology
>> University of Michigan
>> drabo...@umich.edu
>>
>>
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Reply via email to