Re: [R-sig-phylo] model averaging using brownie.lite

Brian O'Meara Thu, 05 Sep 2019 08:01:16 -0700

Sample size is a weird thing in this area for AICc. For comparing DNA
models in something like ModelTest, number of sites is used, but for OU/BM
models, we typically use number of taxa. It's not resolved what's best.

Posada and Buckley (2004, https://doi.org/10.1080/10635150490522304) have a
discussion on this:

Both in the AICc and the BIC descriptions above, the total number of
characters was used as an estimate of sample size. However, effective
sample sizes in phylogenetic studies are poorly understood, and depend on
the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et
al., 2000). Characters in an alignment will often not be independent, so
using the total number of characters as a surrogate for sample size (Minin
et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using
only the number of variable sites as an estimate of sample size is a more
conservative approach, but could be an underestimate (note that all sites
are used when estimating base frequencies or the proportion of invariable
sites). Indeed, sample size also depends on the number of taxa.
Importantly, sample size can have an effect on the outcome of model
selection with the AICc. In our example above, if we were to use the number
of variable characters (301 sites) as the sample size, instead of the total
number of characters (1927 sites), the best AICc model would not change,
but the second and third AICc models would exchange their rankings.
Furthermore, because the LRT, the AIC, and the BIC strategies rely on large
sample asymptotics, it is also important to decide when a sample should be
considered small. Although the AICc was derived under Gaussian assumptions,
Burnham et al. (1994) found that this second order expression performed
well in product multinomial models for open population capture-recapture.
Burnham and Anderson (2003, p. 66) suggest using this correction when the
sample size is small compared to the number of adjustable parameters, n/K <
40. Alternatively, and because AICc converges to the AIC with increasing
n/K ratios, one could always use the AICc (D. Anderson, personal
communications). Phylogenetic characters are mostly discrete, and the
unconstrained model in phylogenetics is multinomial (Goldman, 1993). One
may think of an alignment of nucleotide characters as a large and sparse
contingency table with 4^T bins, where T is the number of taxa. For large
sample asymptotics to hold in a contingency table every cell should
contain, in general, more than 5 observations (see Agresti, 1990, p. 49,
244–250), which gives a rule of thumb of n/4^T > 5. Clearly, more research
is needed on sample size in phylogenetics.

Beaulieu et al. (2018, https://doi.org/10.1093/molbev/msy222; note my COI
as I'm an author on this) did some simulations on a codon model testing
different ways of counting sample size (number of sites, number of taxa,
number of sites * number of taxa, etc.) and found that number of cells in
the matrix (number of sites * number of taxa) seemed to work best to
approximate Kullback-Liebler distance. For univariate models like that used
in brownie.lite, number of cells is equal to number of taxa (since there's
only one column):

We note our use of AICc, as calculated in Burnham and Anderson (2002, p.
66) and as opposed to the standard AIC, in the above model comparisons. At
the outset of our study it was unclear what the appropriate sample size n
is when comparing models of sequence evolution. Building upon the work of
Jhwueng et al. (2014), our simulations suggest that using the number of
taxa times the number of sites as the sample size correction performed best
as a small sample size correction for estimating Kullback–Liebler (KL)
distance in phylogenetic models (Supporting Materials). This also has an
intuitive appeal. In models that have at least some parameters shared
across sites and some parameters shared across taxa, increasing the number
of sites and/or taxa should be adding more samples for the parameters to
estimate. This is consistent considering how likelihood is calculated for
phylogenetic models: the likelihood for a given site is the sum of the
probabilities of each observed state at each tip, which is then multiplied
across sites. It is arguable that the conventional approach in comparative
methods is calculating AICc in the same way. That is, if only one column of
data (or “site”) is examined, as remains remarkably common in comparative
methods, when we refer to sample size, it is technically the number of taxa
multiplied by number of sites, even though it is referred to simply as the
number of taxa.

I suspect this is still not a great approximation. Compare a balanced tree
(every internal node having two descendants) with every internal branch
length the same versus a pectinate (caterpillar) tree where the two edges
connecting to the root node are very long and the other edges are all near
zero. For the same number of taxa and same number of sites, I bet the first
tree has more meaningful data: the pectinate tree with those branch lengths
will likely have all but one of the taxa having nearly identical states. So
I think tree shape and branch lengths should matter for this. I've done
some preliminary analyses on this, building on Beaulieu et al. (2018) and
Jhwueng et al. (2014,  https://doi.org/10.1515/sagmb-2013-0048, also note
COI), but nothing definitive yet.

It's also worth looking at Ho and Ané (2014,
https://doi.org/10.1111/2041-210X.12285) who talk about AIC in the context
of OU shifts, but who get into sample size with shifts in a modified BIC
that uses taxa in different regimes as sample size (but again, univariate,
so maybe it's actually matrix size).

I also probably am missing important work by others -- my apologies if so.
If you know of any, please let me know (and probably Karla, too!).

So, in summary.... yeah, what Liam said: number of taxa, but it might be
more complex.

Best,
Brian

_______________________________________________________________________
Brian O'Meara, http://brianomeara.info, especially Calendar
<http://brianomeara.info/calendar.html>, CV
<http://brianomeara.info/cv.html>, and Feedback
<http://brianomeara.info/feedback.html>

Professor, Dept. of Ecology & Evolutionary Biology, UT Knoxville
Associate Head, Dept. of Ecology & Evolutionary Biology, UT Knoxville
He/Him/His

On Thu, Sep 5, 2019 at 10:00 AM Liam Revell <liam.rev...@umb.edu> wrote:

> Dear Karla.
>
> In my opinion, it is probably correct to use the number of tips on the
> tree as the sample size for AICc when estimating the Brownian rate: as
> the number of independent pieces of information is n-1, just like with
> an ordinary variance. For other parameters in phylogenetic comparative
> analyses, the effective sample size may be different, however.
>
> All the best, Liam
>
> Liam J. Revell
> Associate Professor, University of Massachusetts Boston
> Profesor Asistente, Universidad Católica de la Ssma Concepción
> web: http://faculty.umb.edu/liam.revell/, http://www.phytools.org
>
> Academic Director UMass Boston Chile Abroad (starting 2019):
> https://www.umb.edu/academics/caps/international/biology_chile
>
> On 9/5/2019 9:49 AM, Karla Shikev wrote:
> > [EXTERNAL SENDER]
> >
> > Thanks so much, Liam! Just one quick follow-up question: what do you
> > suggest should be the sample size for transforming AIC into AICc? the
> > number of tips on the tree?
> >
> > Karla
> >
> > On Thu, Sep 5, 2019 at 10:27 AM Liam Revell <liam.rev...@umb.edu> wrote:
> >
> >> Dear Karla.
> >>
> >> You could try & create your own logLik method for the object class
> >> "brownie.lite" as follows:
> >>
> >> ## method
> >> logLik.brownie.lite<-function(object,...){
> >>          lik<-setNames(
> >>                  c(object$logL1,object$logL.multiple),
> >>                  c("single-rate","multi-rate"))
> >>          attr(lik,"df")<-c(object$k1,object$k2)
> >>          lik
> >> }
> >> ## fit model
> >> fit<-brownie.lite(tree,x)
> >> ## use it
> >> logLik(fit)
> >> AIC(fit)
> >>
> >> All the best, Liam
> >>
> >> Liam J. Revell
> >> Associate Professor, University of Massachusetts Boston
> >> Profesor Asistente, Universidad Católica de la Ssma Concepción
> >> web: http://faculty.umb.edu/liam.revell/,
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.phytools.org&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=ofsem4h4SNk6g6QFUwD%2BJKO3TsTArNfH9%2BAyYDEjCvY%3D&amp;reserved=0
> >>
> >> Academic Director UMass Boston Chile Abroad (starting 2019):
> >> https://www.umb.edu/academics/caps/international/biology_chile
> >>
> >> On 9/5/2019 9:13 AM, Karla Shikev wrote:
> >>> [EXTERNAL SENDER]
> >>>
> >>> Dear all,
> >>>
> >>> I've been trying to use brownie.lite to implement the tutorial
> available
> >>> here (
> >>
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftreethinkers.org%2Ftutorials%2Fmorphological-evolution-in-r%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=64k6WMtazzmyn0SLRrx2wEA%2F2wkk3%2B%2F3dBS0HtjlUT8%3D&amp;reserved=0
> )
> >> to
> >>> calculate model-averaged rates of evolution and for model selection (1
> >>> versus 2 rates). However, the current version of phytools 0.6-99 won't
> >>> produce AICc estimates. Does anyone know a way around this? Any help
> >> would
> >>> be greatly appreciated.
> >>>
> >>> thanks a bunch,
> >>>
> >>> Karla
> >>>
> >>>           [[alternative HTML version deleted]]
> >>>
> >>> _______________________________________________
> >>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> >>>
> >>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=ZZxjUW5cV1gb9De3yOjb54RCNlFv2WHWr01lnaeEf54%3D&amp;reserved=0
> >>> Searchable archive at
> >>
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&amp;reserved=0
> >>>
> >>
> >
> >          [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-phylo mailing list - R-sig-phylo@r-project.org
> >
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=S0vvcWinbTdWb4T%2BwD9Fk7gFn6gdhpycbArMGgd7cYI%3D&amp;reserved=0
> > Searchable archive at
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&amp;reserved=0
> >
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] model averaging using brownie.lite

Reply via email to