Emmanuel and Klaus, After Klaus reminded me that `which` was so inefficient I realized using it in finding the match in the first lineage is unnecessary. Instead, we can capitalize on the fact that shallower nodes have smaller nodeids. So I threw out the vector index `mrcaind` altogether:
getMRCA_new <- function(phy, tip) { if (!inherits(phy, "phylo")) stop('object "phy" is not of class "phylo"') if (length(tip) < 2) return(NULL) Ntip <- length(phy$tip.label) rootnd <- Ntip + 1L pars <- integer(phy$Nnode) # worst case assignment, usually far too long tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip ## build a lookup table to get parents faster pvec <- integer(Ntip + phy$Nnode) pvec[phy$edge[, 2]] <- phy$edge[, 1] ## get entire lineage for first tip nd <- tnd[1] for (k in 1:phy$Nnode) { nd <- pvec[nd] pars[k] <- nd if (nd == rootnd) break } pars <- pars[1:k] # delete the rest mrcand <- pars[1]; # keep track of actual node rather than index in pars ## traverse lineages for remaining tips, stop if hit common ancestor for (i in 2:length(tnd)) { done <- FALSE cnd <- tnd[i] while (!done) { cpar <- pvec[cnd] # get immediate parent if (cpar %in% pars) { if (cpar == rootnd) return(rootnd) # early exit mrcand <- min(mrcand, cpar); # shallowest node has lowest nodeid done <- TRUE } cnd <- cpar # keep going! } } mrcand } I also played around with compiling the function: library(compiler); getMRCA_em_cmp <- cmpfun(getMRCA_em); getMRCA_new_cmp <- cmpfun(getMRCA_new); Here are some timings (*_em is your code, *_new is the code above). For my original example (100K tips, 500 tips in MRCA query) there is not a whole lot of difference: microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips), getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips)); Unit: milliseconds expr min lq mean median uq max neval getMRCA_em(phy, tips) 23.99879 29.81964 33.67437 32.39149 35.61888 136.34521 100 getMRCA_new(phy, tips) 23.76941 28.02500 31.84285 31.37001 34.89640 44.66380 100 getMRCA_em_cmp(phy, tips) 20.81924 23.60118 28.23957 28.25202 30.18535 59.72102 100 getMRCA_new_cmp(phy, tips) 20.16588 23.19571 27.19610 26.65783 29.82149 42.61702 100 However, for the worst-case scenario that Klaus devised the results are very impressive: microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)), getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2))); Unit: milliseconds expr min lq mean median uq max neval getMRCA_em(tree, c(1, 2)) 124.79040 140.44473 146.98292 145.39224 150.96044 286.78335 100 getMRCA_new(tree, c(1, 2)) 123.39924 139.41609 145.21311 143.92062 148.34205 297.73846 100 getMRCA_em_cmp(tree, c(1, 2)) 23.72105 26.84944 29.94950 29.87935 32.26490 41.99264 100 getMRCA_new_cmp(tree, c(1, 2)) 23.15223 25.74686 29.21526 29.25370 30.80915 69.68522 100 So it might be worth compiling. HTH. JWB ________________________________________ Joseph W. Brown Post-doctoral Researcher, Smith Laboratory University of Michigan Department of Ecology & Evolutionary Biology Room 2071, Kraus Natural Sciences Building Ann Arbor MI 48109-1079 josep...@umich.edu > On 10 Jun, 2017, at 05:26, Emmanuel Paradis <emmanuel.para...@ird.fr> wrote: > > Joseph, Klaus, > > This is great. I slightly edited the code and renamed the argument 'tips' to > 'tip' (as in the original definition). I run on a bunch of random trees and > the results are exactly identical with the current version of getMRCA. I'm > changing the code in ape now. > > Cheers, > > Emmanuel > > getMRCA <- function(phy, tip) > { > if (!inherits(phy, "phylo")) > stop('object "phy" is not of class "phylo"') > if (length(tip) < 2) return(NULL) > Ntip <- length(phy$tip.label) > rootnd <- Ntip + 1L > > pars <- integer(phy$Nnode) # worst case assignment, usually far too long > mrcaind <- 1L # index in pars of the mrca node. will return highest one > traversed by other lineages > tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip > > ## build a lookup table to get parents faster > pvec <- integer(Ntip + phy$Nnode) > pvec[phy$edge[, 2]] <- phy$edge[, 1] > > ## get entire lineage for first tip > nd <- tnd[1] > for (k in 1:phy$Nnode) { > nd <- pvec[nd] > pars[k] <- nd > if (nd == rootnd) break > } > pars <- pars[1:k] # delete the rest > > ## traverse lineages for remaining tips, stop if hit common ancestor > for (i in 2:length(tnd)) { > done <- FALSE > cnd <- tnd[i] > while (!done) { > cpar <- pvec[cnd] # get immediate parent > if (cpar %in% pars) { > ## early exit! if the mrca of any set of taxa is the root then > we are done > if (cpar == rootnd) return(rootnd) > idx <- which(pars == cpar) > if (idx > mrcaind) mrcaind <- idx > done <- TRUE > } else cnd <- cpar # keep going! > } > } > pars[mrcaind] > } > > > Le 10/06/2017 à 00:28, Klaus Schliep a écrit : >> Hi Joseph >> I guess I figured out where some room for improvement was. >> You grow the vector pars <- c(pars, nd) >> I preallocated pars, but you could even just grow pars using pars[k] <- nd >> without pre-allocation. This operation is now (with R >= 3.4) far less >> costly as mentioned in the NEWS: >> "Assigning to an element of a vector beyond the current length now >> over-allocates by a small fraction. The new vector is marked internally as >> growable, and the true length of the new vector is stored in the truelength >> field. This makes building up a vector result by assigning to the next >> element beyond the current length more efficient, though pre-allocating is >> still preferred. The implementation is subject to change and not intended >> to be used in packages at this time." >> One of the many little improvements recently making R quite a bit faster >> and easier. >> So the worst case behavior was pretty bad >> tree <- stree(100000, "right") >> system.time(get_mrca_fast(tree, c(1,2))) >> user system elapsed >> 13.424 0.012 13.440 >> mrca >> [1] 199999 >> with the changes it is much better: >> system.time(mrca <- get_mrca_fast(tree, c(1,2))) >> user system elapsed >> 0.012 0.000 0.012 >>> mrca >> [1] 199999 >> And here is the code: >> get_mrca_fast <- function (phy, tips) { >> rootnd <- length(phy$tip.label) + 1; >> pars <- integer(phy$Nnode); # worst case assignment, usually far too >> long >> mrcaind <- 1; # index in pars of the mrca node. will return >> highest one traversed by other lineages >> tnd <- NULL; # node ids of tips >> if ("character" %in% class(tips)) { >> tnd <- match(tips, phy$tip.label); >> } else { >> tnd <- tips; >> } >> # build a lookup table to get parents faster >> pvec <- integer(max(phy$edge)); >> pvec[phy$edge[,2]] <- phy$edge[,1]; >> # get entire lineage for first tip >> nd <- tnd[1]; >> for(k in 1:phy$Nnode){ >> nd <- pvec[nd] >> pars[k] <- nd >> if (nd == rootnd) break >> } >> pars <- pars[1:k] #delete the rest >> # traverse lineages for remaining tips, stop if hit common ancestor >> for (i in 2:length(tnd)) { >> done <- FALSE; >> cnd <- tnd[i]; >> while (!done) { >> cpar <- pvec[cnd]; # get immediate parent >> if (cpar %in% pars) { >> if (cpar == rootnd) { >> # early exit! if the mrca of any set of taxa is the >> root then we are done >> return(rootnd); >> } >> idx <- which(pars == cpar); >> if (idx > mrcaind) { >> mrcaind <- idx; >> } >> done <- TRUE; >> } else { >> # keep going! >> cnd <- cpar; >> } >> } >> } >> return(pars[mrcaind]); >> } >> Have a nice weekend, >> Klaus >> On Fri, Jun 9, 2017 at 3:59 PM, Joseph W. Brown <josep...@umich.edu> wrote: >>> Liam, >>> >>> I put the (updated) code up as a gist <https://gist.github.com/josephwb/ >>> ad61fd29ed4fb06e712e23d67422c813>. Feel free to use/improve/whatever. >>> Emmanuel sees room for improvement but frankly taking 0.3 seconds to find >>> the MRCA of 5000 taxa on a million-tip tree is good enough for my purposes. >>> >>> JWB >>> ________________________________________ >>> Joseph W. Brown >>> Post-doctoral Researcher, Smith Laboratory >>> University of Michigan >>> Department of Ecology & Evolutionary Biology >>> Room 2071, Kraus Natural Sciences Building >>> Ann Arbor MI 48109-1079 >>> josep...@umich.edu >>> >>> >>> >>>> On 9 Jun, 2017, at 15:20, Liam J. Revell <liam.rev...@umb.edu> wrote: >>>> >>>> On the other hand, phytools does have a function - the somewhat >>> imprecisely named fastMRCA - which can find the MRCA of just a pair of >>> species much faster than getMRCA (however still slower than or only about >>> as fast as Joseph & Klaus's solutions). >>>> >>>> Liam J. Revell, Associate Professor of Biology >>>> University of Massachusetts Boston >>>> web: http://faculty.umb.edu/liam.revell/ >>>> email: liam.rev...@umb.edu >>>> blog: http://blog.phytools.org >>>> >>>> On 6/9/2017 5:22 AM, Liam J. Revell wrote: >>>>> >>>>> Juan. findMRCA was written before getMRCA existed, but the latter was >>>>> faster so now it just calls getMRCA internally. All the best, Liam >>>>> >>>>> Liam J. Revell, Associate Professor of Biology >>>>> University of Massachusetts Boston >>>>> web: http://faculty.umb.edu/liam.revell/ >>>>> email: liam.rev...@umb.edu >>>>> blog: http://blog.phytools.org >>>>> >>>>> On 6/9/2017 1:57 AM, Juan Antonio Balbuena wrote: >>>>>> Package phytools includes a function, findMRCA, that is supposed to >>> work >>>>>> very efficiently with large trees. you may wish to compare it with your >>>>>> function. >>>>>> >>>>>> Cheers >>>>>> >>>>>> Juan >>>>>> >>>>>> -- >>>>>> >>>>>> Dr. Juan A. Balbuena >>>>>> Cavanilles Institute of Biodiversity and Evolutionary Biology >>>>>> University of Valencia >>>>>> http://www.uv.es/~balbuena <http://www.uv.es/%7Ebalbuena> >>>>>> P.O. Box 22085 >>>>>> http://www.uv.es/cophylpaco >>>>>> <http://www.uv.es/cavanilles/zoomarin/index.htm> >>>>>> 46071 Valencia, Spain >>>>>> e-mail: j.a.balbu...@uv.es <mailto:j.a.balbu...@uv.es> tel. +34 963 >>>>>> 543 658 fax +34 963 543 733 >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> *NOTE!*For shipments by EXPRESS COURIER use the following street >>> address: >>>>>> C/ Catedrático José Beltrán 2, 46980 Paterna (Valencia), Spain. >>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>>> >>>>>> >>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_ >>> source=link&utm_campaign=sig-email&utm_content=emailclient> >>>>>> >>>>>> Libre de virus. www.avast.com >>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_ >>> source=link&utm_campaign=sig-email&utm_content=emailclient> >>>>>> >>>>>> >>>>>> >>>>>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org >>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo >>>>>> Searchable archive at >>>>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/ >>>>>> >>>>> >>>>> _______________________________________________ >>>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo >>>>> Searchable archive at >>>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/ >>> >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> R-sig-phylo mailing list - R-sig-phylo@r-project.org >>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo >>> Searchable archive at http://www.mail-archive.com/r- >>> sig-ph...@r-project.org/ >>> [[alternative HTML version deleted]] _______________________________________________ R-sig-phylo mailing list - R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/