Thank you all for the help and for such an interesting discussion! Karla
On Thu, Sep 5, 2019 at 1:08 PM Cecile Ane <cecile....@wisc.edu> wrote: > Thanks Brian, great review, as always! > > To add one bit: this paper looks at the effective sample size that should > be used for BIC, in the standard BM model (univariate). > https://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908053 > > It gives a formula that depends on the tree shape and branch lengths. Like > what Brian said: a pectinate tree would generally have a smaller effective > sample size than a symmetric tree, for the same number of taxa. The general > formula uses matrix, but the result should be something less than the > number of taxa, and greater than: # branches stemming from the root * ratio > (total tree height / length of shortest branch stemming from the root). The > effective sample size should also be at least (total tree length / total > tree height) for an ultrametric tree. See end of section 2 for an example > of BIC penalties using effectives sample sizes. > > The bottom line is the same as what Brian said: > - it’s generally unknown what “sample size” should be used > - in cases when we know, the answer is complicated (it depends on the tree > and on the model). > > With multivariate data (multiple sites), the effective sample size for > univariate data (like number of taxa or something smaller) should be > multiplied by the number of sites, if the model assumes that sites are > independent and share the same evolutionary parameters. (consistent with > what Brian said). > > Cécile > > On Sep 5, 2019, at 9:55 AM, Brian O'Meara <omeara.br...@gmail.com<mailto: > omeara.br...@gmail.com>> wrote: > > Sample size is a weird thing in this area for AICc. For comparing DNA > models in something like ModelTest, number of sites is used, but for OU/BM > models, we typically use number of taxa. It's not resolved what's best. > > Posada and Buckley (2004, https://doi.org/10.1080/10635150490522304) have > a > discussion on this: > > Both in the AICc and the BIC descriptions above, the total number of > characters was used as an estimate of sample size. However, effective > sample sizes in phylogenetic studies are poorly understood, and depend on > the quantity of interest (Churchill et al., 1992; Goldman, 1998; Morozov et > al., 2000). Characters in an alignment will often not be independent, so > using the total number of characters as a surrogate for sample size (Minin > et al., 2003; Posada and Crandall, 2001b) could be an overestimate. Using > only the number of variable sites as an estimate of sample size is a more > conservative approach, but could be an underestimate (note that all sites > are used when estimating base frequencies or the proportion of invariable > sites). Indeed, sample size also depends on the number of taxa. > Importantly, sample size can have an effect on the outcome of model > selection with the AICc. In our example above, if we were to use the number > of variable characters (301 sites) as the sample size, instead of the total > number of characters (1927 sites), the best AICc model would not change, > but the second and third AICc models would exchange their rankings. > Furthermore, because the LRT, the AIC, and the BIC strategies rely on large > sample asymptotics, it is also important to decide when a sample should be > considered small. Although the AICc was derived under Gaussian assumptions, > Burnham et al. (1994) found that this second order expression performed > well in product multinomial models for open population capture-recapture. > Burnham and Anderson (2003, p. 66) suggest using this correction when the > sample size is small compared to the number of adjustable parameters, n/K < > 40. Alternatively, and because AICc converges to the AIC with increasing > n/K ratios, one could always use the AICc (D. Anderson, personal > communications). Phylogenetic characters are mostly discrete, and the > unconstrained model in phylogenetics is multinomial (Goldman, 1993). One > may think of an alignment of nucleotide characters as a large and sparse > contingency table with 4^T bins, where T is the number of taxa. For large > sample asymptotics to hold in a contingency table every cell should > contain, in general, more than 5 observations (see Agresti, 1990, p. 49, > 244–250), which gives a rule of thumb of n/4^T > 5. Clearly, more research > is needed on sample size in phylogenetics. > > Beaulieu et al. (2018, https://doi.org/10.1093/molbev/msy222; note my COI > as I'm an author on this) did some simulations on a codon model testing > different ways of counting sample size (number of sites, number of taxa, > number of sites * number of taxa, etc.) and found that number of cells in > the matrix (number of sites * number of taxa) seemed to work best to > approximate Kullback-Liebler distance. For univariate models like that used > in brownie.lite, number of cells is equal to number of taxa (since there's > only one column): > > We note our use of AICc, as calculated in Burnham and Anderson (2002, p. > 66) and as opposed to the standard AIC, in the above model comparisons. At > the outset of our study it was unclear what the appropriate sample size n > is when comparing models of sequence evolution. Building upon the work of > Jhwueng et al. (2014), our simulations suggest that using the number of > taxa times the number of sites as the sample size correction performed best > as a small sample size correction for estimating Kullback–Liebler (KL) > distance in phylogenetic models (Supporting Materials). This also has an > intuitive appeal. In models that have at least some parameters shared > across sites and some parameters shared across taxa, increasing the number > of sites and/or taxa should be adding more samples for the parameters to > estimate. This is consistent considering how likelihood is calculated for > phylogenetic models: the likelihood for a given site is the sum of the > probabilities of each observed state at each tip, which is then multiplied > across sites. It is arguable that the conventional approach in comparative > methods is calculating AICc in the same way. That is, if only one column of > data (or “site”) is examined, as remains remarkably common in comparative > methods, when we refer to sample size, it is technically the number of taxa > multiplied by number of sites, even though it is referred to simply as the > number of taxa. > > I suspect this is still not a great approximation. Compare a balanced tree > (every internal node having two descendants) with every internal branch > length the same versus a pectinate (caterpillar) tree where the two edges > connecting to the root node are very long and the other edges are all near > zero. For the same number of taxa and same number of sites, I bet the first > tree has more meaningful data: the pectinate tree with those branch lengths > will likely have all but one of the taxa having nearly identical states. So > I think tree shape and branch lengths should matter for this. I've done > some preliminary analyses on this, building on Beaulieu et al. (2018) and > Jhwueng et al. (2014, https://doi.org/10.1515/sagmb-2013-0048, also note > COI), but nothing definitive yet. > > It's also worth looking at Ho and Ané (2014, > https://doi.org/10.1111/2041-210X.12285) who talk about AIC in the context > of OU shifts, but who get into sample size with shifts in a modified BIC > that uses taxa in different regimes as sample size (but again, univariate, > so maybe it's actually matrix size). > > I also probably am missing important work by others -- my apologies if so. > If you know of any, please let me know (and probably Karla, too!). > > So, in summary.... yeah, what Liam said: number of taxa, but it might be > more complex. > > Best, > Brian > > _______________________________________________________________________ > Brian O'Meara, http://brianomeara.info, especially Calendar > <http://brianomeara.info/calendar.html>, CV > <http://brianomeara.info/cv.html>, and Feedback > <http://brianomeara.info/feedback.html> > > Professor, Dept. of Ecology & Evolutionary Biology, UT Knoxville > Associate Head, Dept. of Ecology & Evolutionary Biology, UT Knoxville > He/Him/His > > > > On Thu, Sep 5, 2019 at 10:00 AM Liam Revell <liam.rev...@umb.edu<mailto: > liam.rev...@umb.edu>> wrote: > > Dear Karla. > > In my opinion, it is probably correct to use the number of tips on the > tree as the sample size for AICc when estimating the Brownian rate: as > the number of independent pieces of information is n-1, just like with > an ordinary variance. For other parameters in phylogenetic comparative > analyses, the effective sample size may be different, however. > > All the best, Liam > > Liam J. Revell > Associate Professor, University of Massachusetts Boston > Profesor Asistente, Universidad Católica de la Ssma Concepción > web: http://faculty.umb.edu/liam.revell/, http://www.phytools.org > > Academic Director UMass Boston Chile Abroad (starting 2019): > https://www.umb.edu/academics/caps/international/biology_chile > > On 9/5/2019 9:49 AM, Karla Shikev wrote: > [EXTERNAL SENDER] > > Thanks so much, Liam! Just one quick follow-up question: what do you > suggest should be the sample size for transforming AIC into AICc? the > number of tips on the tree? > > Karla > > On Thu, Sep 5, 2019 at 10:27 AM Liam Revell <liam.rev...@umb.edu> wrote: > > Dear Karla. > > You could try & create your own logLik method for the object class > "brownie.lite" as follows: > > ## method > logLik.brownie.lite<-function(object,...){ > lik<-setNames( > c(object$logL1,object$logL.multiple), > c("single-rate","multi-rate")) > attr(lik,"df")<-c(object$k1,object$k2) > lik > } > ## fit model > fit<-brownie.lite(tree,x) > ## use it > logLik(fit) > AIC(fit) > > All the best, Liam > > Liam J. Revell > Associate Professor, University of Massachusetts Boston > Profesor Asistente, Universidad Católica de la Ssma Concepción > web: http://faculty.umb.edu/liam.revell/, > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.phytools.org&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&sdata=ofsem4h4SNk6g6QFUwD%2BJKO3TsTArNfH9%2BAyYDEjCvY%3D&reserved=0 > > Academic Director UMass Boston Chile Abroad (starting 2019): > https://www.umb.edu/academics/caps/international/biology_chile > > On 9/5/2019 9:13 AM, Karla Shikev wrote: > [EXTERNAL SENDER] > > Dear all, > > I've been trying to use brownie.lite to implement the tutorial > available > here ( > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Ftreethinkers.org%2Ftutorials%2Fmorphological-evolution-in-r%2F&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&sdata=64k6WMtazzmyn0SLRrx2wEA%2F2wkk3%2B%2F3dBS0HtjlUT8%3D&reserved=0 > ) > to > calculate model-averaged rates of evolution and for model selection (1 > versus 2 rates). However, the current version of phytools 0.6-99 won't > produce AICc estimates. Does anyone know a way around this? Any help > would > be greatly appreciated. > > thanks a bunch, > > Karla > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107478464&sdata=ZZxjUW5cV1gb9De3yOjb54RCNlFv2WHWr01lnaeEf54%3D&reserved=0 > Searchable archive at > > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&reserved=0 > > > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > > > https://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-phylo&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&sdata=S0vvcWinbTdWb4T%2BwD9Fk7gFn6gdhpycbArMGgd7cYI%3D&reserved=0 > Searchable archive at > > https://nam01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.mail-archive.com%2Fr-sig-phylo%40r-project.org%2F&data=02%7C01%7Cliam.revell%40umb.edu%7C04607945a9f74968c14e08d73207f67f%7Cb97188711ee94425953c1ace1373eb38%7C0%7C0%7C637032882107488458&sdata=NUqbn4Yz9gYilJAs7K2mW%2BIANK1%2FmXcpvuIo0Q0h0hw%3D&reserved=0 > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org<mailto: > R-sig-phylo@r-project.org> > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > > > [[alternative HTML version deleted]] > > _______________________________________________ > R-sig-phylo mailing list - R-sig-phylo@r-project.org > https://stat.ethz.ch/mailman/listinfo/r-sig-phylo > Searchable archive at > http://www.mail-archive.com/r-sig-phylo@r-project.org/ > [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/