Thank you all for the help and for such an interesting discussion!

Karla

On Thu, Sep 5, 2019 at 1:08 PM Cecile Ane <cecile....@wisc.edu> wrote:

> Thanks Brian, great review, as always!
>
> To add one bit: this paper looks at the effective sample size that should
> be used for BIC, in the standard BM model (univariate).
> https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908053
>
> It gives a formula that depends on the tree shape and branch lengths. Like
> what Brian said: a pectinate tree would generally have a smaller effective
> sample size than a symmetric tree, for the same number of taxa. The general
> formula uses matrix, but the result should be something less than the
> number of taxa, and greater than: # branches stemming from the root * ratio
> (total tree height / length of shortest branch stemming from the root). The
> effective sample size should also be at least (total tree length / total
> tree height) for an ultrametric tree. See end of section 2 for an example
> of BIC penalties using effectives sample sizes.
>
> The bottom line is the same as what Brian said:
> - it’s generally unknown what “sample size” should be used
> - in cases when we know, the answer is complicated (it depends on the tree
> and on the model).
>
> With multivariate data (multiple sites), the effective sample size for
> univariate data (like number of taxa or something smaller) should be
> multiplied by the number of sites, if the model assumes that sites are
> independent and share the same evolutionary parameters. (consistent with
> what Brian said).
>
> Cécile
>
> On Sep 5, 2019, at 9:55 AM, Brian O'Meara <omeara.br...@gmail.com<mailto:
> omeara.br...@gmail.com>> wrote:
>
> Sample size is a weird thing in this area for AICc. For comparing DNA
> models in something like ModelTest, number of sites is used, but for OU/BM
> models, we typically use number of taxa. It's not resolved what's best.
>
> Posada and Buckley (2004, https://doi.org/10.1080/10635150490522304) have
> a
> discussion on this:
>
> Both in the AICc and the BIC descriptions above, the total number of
> characters was used as an estimate of sample size. However, effective
> sample sizes in phylogenetic studies are poorly understood, and depend on
> the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et
> al., 2000). Characters in an alignment will often not be independent, so
> using the total number of characters as a surrogate for sample size (Minin
> et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using
> only the number of variable sites as an estimate of sample size is a more
> conservative approach, but could be an underestimate (note that all sites
> are used when estimating base frequencies or the proportion of invariable
> sites). Indeed, sample size also depends on the number of taxa.
> Importantly, sample size can have an effect on the outcome of model
> selection with the AICc. In our example above, if we were to use the number
> of variable characters (301 sites) as the sample size, instead of the total
> number of characters (1927 sites), the best AICc model would not change,
> but the second and third AICc models would exchange their rankings.
> Furthermore, because the LRT, the AIC, and the BIC strategies rely on large
> sample asymptotics, it is also important to decide when a sample should be
> considered small. Although the AICc was derived under Gaussian assumptions,
> Burnham et al. (1994) found that this second order expression performed
> well in product multinomial models for open population capture-recapture.
> Burnham and Anderson (2003, p. 66) suggest using this correction when the
> sample size is small compared to the number of adjustable parameters, n/K <
> 40. Alternatively, and because AICc converges to the AIC with increasing
> n/K ratios, one could always use the AICc (D. Anderson, personal
> communications). Phylogenetic characters are mostly discrete, and the
> unconstrained model in phylogenetics is multinomial (Goldman, 1993). One
> may think of an alignment of nucleotide characters as a large and sparse
> contingency table with 4^T bins, where T is the number of taxa. For large
> sample asymptotics to hold in a contingency table every cell should
> contain, in general, more than 5 observations (see Agresti, 1990, p. 49,
> 244–250), which gives a rule of thumb of n/4^T > 5. Clearly, more research
> is needed on sample size in phylogenetics.
>
> Beaulieu et al. (2018, https://doi.org/10.1093/molbev/msy222; note my COI
> as I'm an author on this) did some simulations on a codon model testing
> different ways of counting sample size (number of sites, number of taxa,
> number of sites * number of taxa, etc.) and found that number of cells in
> the matrix (number of sites * number of taxa) seemed to work best to
> approximate Kullback-Liebler distance. For univariate models like that used
> in brownie.lite, number of cells is equal to number of taxa (since there's
> only one column):
>
> We note our use of AICc, as calculated in Burnham and Anderson (2002, p.
> 66) and as opposed to the standard AIC, in the above model comparisons. At
> the outset of our study it was unclear what the appropriate sample size n
> is when comparing models of sequence evolution. Building upon the work of
> Jhwueng et al. (2014), our simulations suggest that using the number of
> taxa times the number of sites as the sample size correction performed best
> as a small sample size correction for estimating Kullback–Liebler (KL)
> distance in phylogenetic models (Supporting Materials). This also has an
> intuitive appeal. In models that have at least some parameters shared
> across sites and some parameters shared across taxa, increasing the number
> of sites and/or taxa should be adding more samples for the parameters to
> estimate. This is consistent considering how likelihood is calculated for
> phylogenetic models: the likelihood for a given site is the sum of the
> probabilities of each observed state at each tip, which is then multiplied
> across sites. It is arguable that the conventional approach in comparative
> methods is calculating AICc in the same way. That is, if only one column of
> data (or “site”) is examined, as remains remarkably common in comparative
> methods, when we refer to sample size, it is technically the number of taxa
> multiplied by number of sites, even though it is referred to simply as the
> number of taxa.
>
> I suspect this is still not a great approximation. Compare a balanced tree
> (every internal node having two descendants) with every internal branch
> length the same versus a pectinate (caterpillar) tree where the two edges
> connecting to the root node are very long and the other edges are all near
> zero. For the same number of taxa and same number of sites, I bet the first
> tree has more meaningful data: the pectinate tree with those branch lengths
> will likely have all but one of the taxa having nearly identical states. So
> I think tree shape and branch lengths should matter for this. I've done
> some preliminary analyses on this, building on Beaulieu et al. (2018) and
> Jhwueng et al. (2014,  https://doi.org/10.1515/sagmb-2013-0048, also note
> COI), but nothing definitive yet.
>
> It's also worth looking at Ho and Ané (2014,
> https://doi.org/10.1111/2041-210X.12285) who talk about AIC in the context
> of OU shifts, but who get into sample size with shifts in a modified BIC
> that uses taxa in different regimes as sample size (but again, univariate,
> so maybe it's actually matrix size).
>
> I also probably am missing important work by others -- my apologies if so.
> If you know of any, please let me know (and probably Karla, too!).
>
> So, in summary.... yeah, what Liam said: number of taxa, but it might be
> more complex.
>
> Best,
> Brian
>
> _______________________________________________________________________
> Brian O'Meara, http://brianomeara.info, especially Calendar
> <http://brianomeara.info/calendar.html>, CV
> <http://brianomeara.info/cv.html>, and Feedback
> <http://brianomeara.info/feedback.html>
>
> Professor, Dept. of Ecology & Evolutionary Biology, UT Knoxville
> Associate Head, Dept. of Ecology & Evolutionary Biology, UT Knoxville
> He/Him/His
>
>
>
> On Thu, Sep 5, 2019 at 10:00 AM Liam Revell <liam.rev...@umb.edu<mailto:
> liam.rev...@umb.edu>> wrote:
>
> Dear Karla.
>
> In my opinion, it is probably correct to use the number of tips on the
> tree as the sample size for AICc when estimating the Brownian rate: as
> the number of independent pieces of information is n-1, just like with
> an ordinary variance. For other parameters in phylogenetic comparative
> analyses, the effective sample size may be different, however.
>
> All the best, Liam
>
> Liam J. Revell
> Associate Professor, University of Massachusetts Boston
> Profesor Asistente, Universidad Católica de la Ssma Concepción
> web: http://faculty.umb.edu/liam.revell/, http://www.phytools.org
>
> Academic Director UMass Boston Chile Abroad (starting 2019):
> https://www.umb.edu/academics/caps/international/biology_chile
>
> On 9/5/2019 9:49 AM, Karla Shikev wrote:
> [EXTERNAL SENDER]
>
> Thanks so much, Liam! Just one quick follow-up question: what do you
> suggest should be the sample size for transforming AIC into AICc? the
> number of tips on the tree?
>
> Karla
>
> On Thu, Sep 5, 2019 at 10:27 AM Liam Revell <liam.rev...@umb.edu> wrote:
>
> Dear Karla.
>
> You could try & create your own logLik method for the object class
> "brownie.lite" as follows:
>
> ## method
> logLik.brownie.lite<-function(object,...){
>         lik<-setNames(
>                 c(object$logL1,object$logL.multiple),
>                 c("single-rate","multi-rate"))
>         attr(lik,"df")<-c(object$k1,object$k2)
>         lik
> }
> ## fit model
> fit<-brownie.lite(tree,x)
> ## use it
> logLik(fit)
> AIC(fit)
>
> All the best, Liam
>
> Liam J. Revell
> Associate Professor, University of Massachusetts Boston
> Profesor Asistente, Universidad Católica de la Ssma Concepción
> web: http://faculty.umb.edu/liam.revell/,
>
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.phytools.org&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=ofsem4h4SNk6g6QFUwD%2BJKO3TsTArNfH9%2BAyYDEjCvY%3D&amp;reserved=0
>
> Academic Director UMass Boston Chile Abroad (starting 2019):
> https://www.umb.edu/academics/caps/international/biology_chile
>
> On 9/5/2019 9:13 AM, Karla Shikev wrote:
> [EXTERNAL SENDER]
>
> Dear all,
>
> I've been trying to use brownie.lite to implement the tutorial
> available
> here (
>
>
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftreethinkers.org%2Ftutorials%2Fmorphological-evolution-in-r%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=64k6WMtazzmyn0SLRrx2wEA%2F2wkk3%2B%2F3dBS0HtjlUT8%3D&amp;reserved=0
> )
> to
> calculate model-averaged rates of evolution and for model selection (1
> versus 2 rates). However, the current version of phytools 0.6-99 won't
> produce AICc estimates. Does anyone know a way around this? Any help
> would
> be greatly appreciated.
>
> thanks a bunch,
>
> Karla
>
>          [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>
>
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&amp;sdata=ZZxjUW5cV1gb9De3yOjb54RCNlFv2WHWr01lnaeEf54%3D&amp;reserved=0
> Searchable archive at
>
>
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&amp;reserved=0
>
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>
>
> https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=S0vvcWinbTdWb4T%2BwD9Fk7gFn6gdhpycbArMGgd7cYI%3D&amp;reserved=0
> Searchable archive at
>
> https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&amp;data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&amp;sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&amp;reserved=0
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org<mailto:
> R-sig-phylo@r-project.org>
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>
>
>         [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to