Hi all,
        <apologies for the long email>

I'm a bit more concerned with Dan's elevated Type-1 error rates than Jarrod's 
example.

With respect to Jarrod's simulations, I have a few thoughts:
        1. I don't understand the claim (in the original email) that "its 
fairly straightforward to prove that asymmetric transition rates cannot be 
identified using data collected on the tips of a phylogeny" It seems like this 
is something that is routinely done in phylogenetics, and proofs of 
identifiability of GTR exist (demonstrating that  this indeed feasible and not 
some computational artifact).

        2. I think that Dan's original response is correct. We should not 
expect to reject the null only 5% of the time if we simulate a bias in the 
states of the tips. Simulating data without respect to a phylogeny and then 
analyzing under Mk or "ARD" should be like a simulation in which the rate of 
character evolution is essentially infinite.  The Mk model predicts the states 
to be equally frequent, and so it is not problematic for a deviation from 50:50 
to favor the ARD model over the Mk model. In fact, the simulation model is 
equivalent the ARD model with a rate of evolution approaching infinity, so we 
should prefer the ARD model.

        3. (in reference to Matt's post) In the ARD model, it is possible for 
the MLE of the equilibrium state frequency of the less frequent state to be 
>0.5. Presumably, this is a rare occurrence, but I don't agree with the 
characterization of the ratio of parameters converging to the ratio of states. 

        Consider a clade with lots of tips in state 1 but tiny branch lengths. 
If this clade is found in the context of a tree with a few other branches that 
are long and lead to tips with state 0, then you can get an MLE of the state 
frequency for state 0 being > 0.5. Most of the tips will have state 1, but 
because they are easily explained by one transition to 1 you can still infer 
that the less frequent state has a higher equilibrium frequency.

        Perhaps, I'm mis-reading what Matt is referring to when he discusses an 
analysis of a tree with "a lot of tips (ie. approaching the limit)." I do agree 
that if you simulate a very large tree under ARD (with the frequencies not 
equal to 0.5), then the frequency of the states at the tips will converge to 
the equilibrium state frequencies. 
        


With respect to Dan's results:

The Type I error rate of 0.12 troubles me.  Have you tried exporting the data 
and seeing if other software agrees with the likelihood ratios returned by 
ace() ?  I looked at the code for ace and nothing looked amiss to me (though my 
R skills are virtually non-existent).

If the result is corroborated by other software, then my best guesses would be:
        1. ascertainment bias (the simulation model clearly excludes constant 
patterns, but I don't believe that the inference model in ace does any 
correction for this), and/or
        2. the assumption that you can use the chi-square as the null 
distribution for the LRT probably breaks down when you have very few events on 
the tree. In some sense the number of events is our measure of the amount of 
data, and when we have very few events on the tree the asymptotic behavior of 
the LRT under the null is probably not going to help us. 

In the limiting case, when rates of character change are so low that you never 
see homoplasy, then I think the LRT of the the two models should get close to 1 
on virtually any realization (conditional on starting in state 1 and having 
exactly 1 change on the tree, both model make the same predictions about the 
data; so in this realm the data should not prefer one model over the other).  
So, I'm not sure how the small data explanation would explain your observation 
of an excess of large LRT statistics.


those are my 2 cents.

all the best,
Mark



        
On Aug 16, 2012, at 4:24 PM, Dan Rabosky wrote:

> 
> Hi All-
> 
> This is an interesting discussion. I think there is clearly something going 
> on. I do not get catastrophic Type I error rates from this exercise (only 
> 'elevated') with discrete char simulations (an equal rates markov process) - 
> see below. However, Jarrod's latent model seems reasonable and probably a 
> decent approximation of many real phenomena, so it is extremely worrying that 
> Type I error rates would be so high. 
> 
> Also, I don't see why the BiSSE framework would be immune to this problem. If 
> the "true" process of character change (regardless of its relationship to 
> speciation and extinction) occurs with a latent/liability model, it seems 
> that BiSSE would also be affected.  
> 
> This is something to think (and worry) about.
> 
> #Simulations using pure birth tree with 100 tips; birthrate = 1.0. 
> # pure birth used as these branch lengths are more consistent with those in a 
> typical interspecific dataset
> 
> #Rate = 0.01 
> Mean frequency of root state in pure birth tree of 100 tips: 0.930
> Type I error rate: 0.12
> 
> # Rate = 0.05.
> Frequency of root state: 0.86
> Type I error rate: 0.08
> 
> # Rate = 0.1.
> Frequency of root state: 0.77
> Type I error rate: 0.08
> 
> #Rate = 1.0
> Frequency of root state: 0.501
> Type I error rate: 0.03
> 
> #Rate = 10
> Frequency of root state: 0.49
> Type I error rate: 0.04
> 
> 
> 
> #Simulation code
> 
> library(geiger);
> 
> # quick fix to deal with tree simulations in Geiger, eg birthdeath.tree
> #     stops at exactly N taxa, but this gives a pair of 0 length branches:
> 
> x <- birthdeath.tree(b=1, d=0, taxa.stop=101);
> x <- drop.tip(x, x$tip.label[x$edge[,2][x$edge.length==0][1]]);
> 
> # We drop one of the tips with 0 length branches...
> 
> 
> ## Set up rate matrix:
> rate1 <-1;
> qmat <- list(rbind(c(-rate1, rate1), c(rate1, -rate1))); #FAST rate matrix
> 
> ## PLot a tree with character states, just to get a feel for phylogenetic 
> ##    signal in the character state data:
> 
> sim <-sim.char(x, qmat, model="discrete", n=1, root.state=1);
> cvec <- c('black', 'white');
> plot.phylo(x, show.tip.label=F);
> tiplabels(pch=21, col='black', bg=cvec[sim], cex=0.9); 
> 
> NSIMS <- 100;
> fdif <- numeric(NSIMS); # vector for character frequencies
> pvec <- numeric(NSIMS); # vector for pvalues
> 
> for (i in 1:NSIMS){
>       cat(i, '\n');
>       sim <-sim.char(x, qmat, model="discrete", n=1, root.state=1);
>       
>       # make sure at least both states are observed in dataset:
>       while (sum(sim==1) == length(sim)){
>               sim <-sim.char(x, qmat, model="discrete", n=1, root.state=1);
>       }
>       
>       tx <-table(sim);
>       fdif[i] <- tx['1'] / sum(tx);
>       
>       m1 <- ace(sim, x, type = 'd', model ='SYM');
>       m2 <- ace(sim, x, type = 'd', model = 'ARD');
>       lrt <- -2*m1$loglik + 2*m2$loglik;
>       pvec[i] <- 1 - pchisq(lrt, 1)
>       
> }
> 
> mean(fdif);
> sum(pvec <= 0.05);
> 
> 
> 
> On Aug 16, 2012, at 11:04 AM, Matt Pennell wrote:
> 
>> Jarrod and Dan,
>> 
>> While I see what Dan is saying and I agree that evaluating this with 
>> non-phylogenetic data is not entirely useful, I think you have stumbled upon 
>> a known issue but one that is not widely appreciated.
>> 
>> While the MK model is a useful model for discrete characters in many ways, 
>> it may be inappropriate for such a test. If you assume that the root is 
>> equally likely to be in either state with a lot of tips (i.e. approaching 
>> the limit), the ML estimate for the ratio of q01 (transition rate from 0 to 
>> 1) to q10 will converge on the ratio between number of tips in state 1 to 
>> number of tips in state 0. So if you have a tree where most of the tips are 
>> in state 1 then you will get support for the asymmetric change model as this 
>> best explains the data. It is a very weird problem and perhaps I am 
>> incorrect in this. 
>> 
>> Regardless, this result does not hold if the discrete character is modeled 
>> simulataneously with the branching process of the phylogeny (i.e. BiSSE; 
>> Maddison et al. 2007).
>> 
>> So, in summation, I mostly agree with you. But this is not shocking behavior 
>> of the model. If you are interested, a Bayesian implementation of the Mk 
>> model can be found in the package diversitree in the function make.mk.
>> 
>> again, i could be off base here. curious to see what others think?
>> 
>> matt
>> 
>> On Thu, Aug 16, 2012 at 10:58 AM, Dan Rabosky <drabo...@umich.edu> wrote:
>> 
>> HI Jarrod-
>> 
>> It isn't immediately obvious to me why the exercise below reflects something 
>> problematic. In the first scenario, you have a random binary state but with 
>> strong differences in frequency. Because there is effectively no 
>> phylogenetic signal (as data are simply random), this suggests a high rate 
>> of transition between states. I think that such asymmetry in frequencies 
>> would be highly unlikely under a symmetric model of character change 
>> regardless of the root state. I think it is useful to think about whether a 
>> symmetric process could have given rise to a truly random distribution of 
>> tip data with strong frequency differences - my intuition is that this is 
>> highly unlikely. However, I would be happy to know what others think.
>> 
>> Cheers,
>> ~Dan
>> 
>> 
>> On Aug 16, 2012, at 10:09 AM, Jarrod Hadfield wrote:
>> 
>>> Hi,
>>> 
>>> I have had a few replies off-list which have made me try and clarify what I 
>>> mean.  I think the distinction needs to be made between two types of 
>>> probability: the probability  that an outcome is 0 or 1 Pr(y| \theta) and 
>>> the probability density of \theta, Pr(\theta | \gamma), indexed by 
>>> parameter(s) \gamma.  It seems to me that in order to make the problem 
>>> identifiable an informative prior (or an assumption) is required for the 
>>> root state.  It seems to me that the strong prior Pr(\theta=0.5|\gamma) =1 
>>> is used, and then justified because int Pr(y=0| \theta)Pr(theta)dtheta=int 
>>> Pr(y=1| \theta)Pr(theta)dtheta=0.5. A non-informative prior distribution 
>>> for \theta could be U(0,1). This also induces a prior on y of the same 
>>> form, int Pr(y=0| \theta)Pr(theta)dtheta=int Pr(y=1| 
>>> \theta)Pr(theta)dtheta=0.5, but that is not the main motivation for 
>>> choosing U(0,1).
>>> 
>>> For example, is this not worrying:
>>> 
>>> library(ape)
>>> n<-100
>>> tree<-rcoal(n)             # random tree
>>> y<-rbinom(n, 1, 0.2)  # random data unconnected to the tree
>>> m1<-ace(y, tree, type = "d", model="SYM")
>>> m2<-ace(y, tree, type = "d", model="ARD")
>>> anova(m1, m2)     # asymmetric evolutionary transition rates strongly 
>>> supported
>>> 
>>> y<-rbinom(n, 1, 0.5)  # random data unconnected to the tree but p=0.5
>>> m1<-ace(y, tree, type = "d", model="SYM")
>>> m2<-ace(y, tree, type = "d", model="ARD")
>>> anova(m1, m2)     # asymmetric evolutionary transition not supported
>>> 
>>> Cheers,
>>> 
>>> Jarrod
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Quoting Jarrod Hadfield <j.hadfi...@ed.ac.uk> on Thu, 16 Aug 2012 12:30:30 
>>> +0100:
>>> 
>>>> Hi,
>>>> 
>>>> I have been helping someone with some analyses and came across some 
>>>> routines to estimate asymmetric transition rates between discrete 
>>>> characters. This surprised me because its fairly straightforward to prove 
>>>> that asymmetric transition rates cannot be identified using data collected 
>>>> on the tips of a phylogeny. However when I run these routines (e.g. ace) 
>>>> likelihood ratio tests often suggest strong support for asymmetric rates. 
>>>> This seems to arise because there is an implicit (and unjustified) prior 
>>>> point mass on the ancestral state *probability*. If anyone could point me 
>>>> to reference that states this that would be great? I really don't want to 
>>>> be saying we have support for processes that we need a fossil record, not 
>>>> just a phylogeny, to understand.
>>>> 
>>>> Cheers,
>>>> 
>>>> Jarrod
>>>> 
>>>> 
>>>> 
>>>> --
>>>> The University of Edinburgh is a charitable body, registered in
>>>> Scotland, with registration number SC005336.
>>>> 
>>>> _______________________________________________
>>>> R-sig-phylo mailing list
>>>> R-sig-phylo@r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> The University of Edinburgh is a charitable body, registered in
>>> Scotland, with registration number SC005336.
>>> 
>>> _______________________________________________
>>> R-sig-phylo mailing list
>>> R-sig-phylo@r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>> 
>>> 
>> 
>> 
>> 
>> 
>>        [[alternative HTML version deleted]]
>> 
>> _______________________________________________
>> R-sig-phylo mailing list
>> R-sig-phylo@r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>> 
> 
> 
>       [[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-phylo mailing list
> R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> 

Mark Holder

mthol...@ku.edu
http://phylo.bio.ku.edu/mark-holder

==============================================
Department of Ecology and Evolutionary Biology
University of Kansas
6031 Haworth Hall
1200 Sunnyside Avenue
Lawrence, Kansas 66045

lab phone:  785.864.5789

fax (shared): 785.864.5860
==============================================

_______________________________________________
R-sig-phylo mailing list
R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo

Reply via email to