Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

Joseph W. Brown Sat, 10 Jun 2017 05:29:47 -0700

Emmanuel and Klaus,

After Klaus reminded me that `which` was so inefficient I realized using it in 
finding the match in the first lineage is unnecessary. Instead, we can 
capitalize on the fact that shallower nodes have smaller nodeids. So I threw 
out the vector index `mrcaind` altogether:


getMRCA_new <- function(phy, tip) {
   if (!inherits(phy, "phylo"))
       stop('object "phy" is not of class "phylo"')
   if (length(tip) < 2) return(NULL)
   Ntip <- length(phy$tip.label)
   rootnd <- Ntip + 1L

   pars <- integer(phy$Nnode) # worst case assignment, usually far too long
   tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

   ## build a lookup table to get parents faster
   pvec <- integer(Ntip + phy$Nnode)
   pvec[phy$edge[, 2]] <- phy$edge[, 1]

   ## get entire lineage for first tip
   nd <- tnd[1]
   for (k in 1:phy$Nnode) {
       nd <- pvec[nd]
       pars[k] <- nd
       if (nd == rootnd) break
   }
   pars <- pars[1:k] # delete the rest
   mrcand <- pars[1]; # keep track of actual node rather than index in pars
   
   ## traverse lineages for remaining tips, stop if hit common ancestor
   for (i in 2:length(tnd)) {
       done <- FALSE
       cnd <- tnd[i]
       while (!done) {
           cpar <- pvec[cnd] # get immediate parent
           if (cpar %in% pars) {
               if (cpar == rootnd) return(rootnd) # early exit
               mrcand <- min(mrcand, cpar); # shallowest node has lowest nodeid
               done <- TRUE
           }
           cnd <- cpar # keep going!
       }
   }
   mrcand
}

I also played around with compiling the function:

library(compiler);
getMRCA_em_cmp <- cmpfun(getMRCA_em);
getMRCA_new_cmp <- cmpfun(getMRCA_new);

Here are some timings (*_em is your code, *_new is the code above). For my 
original example (100K tips, 500 tips in MRCA query) there is not a whole lot 
of difference:

microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips), 
getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips));
Unit: milliseconds
                       expr      min       lq     mean   median       uq       
max neval
      getMRCA_em(phy, tips) 23.99879 29.81964 33.67437 32.39149 35.61888 
136.34521   100
     getMRCA_new(phy, tips) 23.76941 28.02500 31.84285 31.37001 34.89640  
44.66380   100
  getMRCA_em_cmp(phy, tips) 20.81924 23.60118 28.23957 28.25202 30.18535  
59.72102   100
 getMRCA_new_cmp(phy, tips) 20.16588 23.19571 27.19610 26.65783 29.82149  
42.61702   100

However, for the worst-case scenario that Klaus devised the results are very 
impressive:

microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)), 
getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2)));
Unit: milliseconds
                           expr       min        lq      mean    median        
uq       max neval
      getMRCA_em(tree, c(1, 2)) 124.79040 140.44473 146.98292 145.39224 
150.96044 286.78335   100
     getMRCA_new(tree, c(1, 2)) 123.39924 139.41609 145.21311 143.92062 
148.34205 297.73846   100
  getMRCA_em_cmp(tree, c(1, 2))  23.72105  26.84944  29.94950  29.87935  
32.26490  41.99264   100
 getMRCA_new_cmp(tree, c(1, 2))  23.15223  25.74686  29.21526  29.25370  
30.80915  69.68522   100

So it might be worth compiling.

HTH.
JWB
________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu



> On 10 Jun, 2017, at 05:26, Emmanuel Paradis <emmanuel.para...@ird.fr> wrote:
> 
> Joseph, Klaus,
> 
> This is great. I slightly edited the code and renamed the argument 'tips' to 
> 'tip' (as in the original definition). I run on a bunch of random trees and 
> the results are exactly identical with the current version of getMRCA. I'm 
> changing the code in ape now.
> 
> Cheers,
> 
> Emmanuel
> 
> getMRCA <- function(phy, tip)
> {
>    if (!inherits(phy, "phylo"))
>        stop('object "phy" is not of class "phylo"')
>    if (length(tip) < 2) return(NULL)
>    Ntip <- length(phy$tip.label)
>    rootnd <- Ntip + 1L
> 
>    pars <- integer(phy$Nnode) # worst case assignment, usually far too long
>    mrcaind <- 1L # index in pars of the mrca node. will return highest one 
> traversed by other lineages
>    tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip
> 
>    ## build a lookup table to get parents faster
>    pvec <- integer(Ntip + phy$Nnode)
>    pvec[phy$edge[, 2]] <- phy$edge[, 1]
> 
>    ## get entire lineage for first tip
>    nd <- tnd[1]
>    for (k in 1:phy$Nnode) {
>        nd <- pvec[nd]
>        pars[k] <- nd
>        if (nd == rootnd) break
>    }
>    pars <- pars[1:k] # delete the rest
> 
>    ## traverse lineages for remaining tips, stop if hit common ancestor
>    for (i in 2:length(tnd)) {
>        done <- FALSE
>        cnd <- tnd[i]
>        while (!done) {
>            cpar <- pvec[cnd] # get immediate parent
>            if (cpar %in% pars) {
>                ## early exit! if the mrca of any set of taxa is the root then 
> we are done
>                if (cpar == rootnd) return(rootnd)
>                idx <- which(pars == cpar)
>                if (idx > mrcaind)  mrcaind <- idx
>                done <- TRUE
>            } else cnd <- cpar # keep going!
>        }
>    }
>    pars[mrcaind]
> }
> 
> 
> Le 10/06/2017 à 00:28, Klaus Schliep a écrit :
>> Hi Joseph
>> I guess I figured out where some room for improvement was.
>> You grow the vector pars <- c(pars, nd)
>> I preallocated pars, but you could even just grow pars using pars[k] <- nd
>> without pre-allocation. This operation is now (with R >= 3.4) far less
>> costly as mentioned in the NEWS:
>> "Assigning to an element of a vector beyond the current length now
>> over-allocates by a small fraction. The new vector is marked internally as
>> growable, and the true length of the new vector is stored in the truelength
>> field. This makes building up a vector result by assigning to the next
>> element beyond the current length more efficient, though pre-allocating is
>> still preferred. The implementation is subject to change and not intended
>> to be used in packages at this time."
>> One of the many little improvements recently making R quite a bit faster
>> and easier.
>> So the worst case behavior was pretty bad
>> tree <- stree(100000, "right")
>> system.time(get_mrca_fast(tree, c(1,2)))
>>    user  system elapsed
>>  13.424   0.012  13.440
>>  mrca
>> [1] 199999
>> with the changes it is much better:
>> system.time(mrca <- get_mrca_fast(tree, c(1,2)))
>>    user  system elapsed
>>   0.012   0.000   0.012
>>> mrca
>> [1] 199999
>> And here is the code:
>> get_mrca_fast <- function (phy, tips) {
>>     rootnd <- length(phy$tip.label) + 1;
>>     pars <- integer(phy$Nnode);  # worst case assignment, usually far too
>> long
>>     mrcaind <- 1;      # index in pars of the mrca node. will return
>> highest one traversed by other lineages
>>     tnd <- NULL;       # node ids of tips
>>     if ("character" %in% class(tips)) {
>>         tnd <- match(tips, phy$tip.label);
>>     } else {
>>         tnd <- tips;
>>     }
>>     # build a lookup table to get parents faster
>>     pvec <- integer(max(phy$edge));
>>     pvec[phy$edge[,2]] <- phy$edge[,1];
>>     # get entire lineage for first tip
>>     nd <- tnd[1];
>>     for(k in 1:phy$Nnode){
>>         nd <- pvec[nd]
>>         pars[k] <- nd
>>         if (nd == rootnd) break
>>     }
>>     pars <- pars[1:k] #delete the rest
>>     # traverse lineages for remaining tips, stop if hit common ancestor
>>     for (i in 2:length(tnd)) {
>>         done <- FALSE;
>>         cnd <- tnd[i];
>>         while (!done) {
>>             cpar <- pvec[cnd]; # get immediate parent
>>             if (cpar %in% pars) {
>>                 if (cpar == rootnd) {
>>                     # early exit! if the mrca of any set of taxa is the
>> root then we are done
>>                     return(rootnd);
>>                 }
>>                 idx <- which(pars == cpar);
>>                 if (idx > mrcaind) {
>>                     mrcaind <- idx;
>>                 }
>>                 done <- TRUE;
>>             } else {
>>                 # keep going!
>>                 cnd <- cpar;
>>             }
>>         }
>>     }
>>     return(pars[mrcaind]);
>> }
>> Have a nice weekend,
>> Klaus
>> On Fri, Jun 9, 2017 at 3:59 PM, Joseph W. Brown <josep...@umich.edu> wrote:
>>> Liam,
>>> 
>>> I put the (updated) code up as a gist <https://gist.github.com/josephwb/
>>> ad61fd29ed4fb06e712e23d67422c813>. Feel free to use/improve/whatever.
>>> Emmanuel sees room for improvement but frankly taking 0.3 seconds to find
>>> the MRCA of 5000 taxa on a million-tip tree is good enough for my purposes.
>>> 
>>> JWB
>>> ________________________________________
>>> Joseph W. Brown
>>> Post-doctoral Researcher, Smith Laboratory
>>> University of Michigan
>>> Department of Ecology & Evolutionary Biology
>>> Room 2071, Kraus Natural Sciences Building
>>> Ann Arbor MI 48109-1079
>>> josep...@umich.edu
>>> 
>>> 
>>> 
>>>> On 9 Jun, 2017, at 15:20, Liam J. Revell <liam.rev...@umb.edu> wrote:
>>>> 
>>>> On the other hand, phytools does have a function - the somewhat
>>> imprecisely named fastMRCA - which can find the MRCA of just a pair of
>>> species much faster than getMRCA (however still slower than or only about
>>> as fast as Joseph & Klaus's solutions).
>>>> 
>>>> Liam J. Revell, Associate Professor of Biology
>>>> University of Massachusetts Boston
>>>> web: http://faculty.umb.edu/liam.revell/
>>>> email: liam.rev...@umb.edu
>>>> blog: http://blog.phytools.org
>>>> 
>>>> On 6/9/2017 5:22 AM, Liam J. Revell wrote:
>>>>> 
>>>>> Juan. findMRCA was written before getMRCA existed, but the latter was
>>>>> faster so now it just calls getMRCA internally. All the best, Liam
>>>>> 
>>>>> Liam J. Revell, Associate Professor of Biology
>>>>> University of Massachusetts Boston
>>>>> web: http://faculty.umb.edu/liam.revell/
>>>>> email: liam.rev...@umb.edu
>>>>> blog: http://blog.phytools.org
>>>>> 
>>>>> On 6/9/2017 1:57 AM, Juan Antonio Balbuena wrote:
>>>>>> Package phytools includes a function, findMRCA, that is supposed to
>>> work
>>>>>> very efficiently with large trees. you may wish to compare it with your
>>>>>> function.
>>>>>> 
>>>>>> Cheers
>>>>>> 
>>>>>> Juan
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> Dr. Juan A. Balbuena
>>>>>> Cavanilles Institute of Biodiversity and Evolutionary Biology
>>>>>> University of Valencia
>>>>>> http://www.uv.es/~balbuena <http://www.uv.es/%7Ebalbuena>
>>>>>> P.O. Box 22085
>>>>>> http://www.uv.es/cophylpaco
>>>>>> <http://www.uv.es/cavanilles/zoomarin/index.htm>
>>>>>> 46071 Valencia, Spain
>>>>>> e-mail: j.a.balbu...@uv.es <mailto:j.a.balbu...@uv.es>    tel. +34 963
>>>>>> 543 658    fax +34 963 543 733
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> *NOTE!*For shipments by EXPRESS COURIER use the following street
>>> address:
>>>>>> C/ Catedrático José Beltrán 2, 46980 Paterna (Valencia), Spain.
>>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>> 
>>>>>> 
>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_
>>> source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>> 
>>>>>>      Libre de virus. www.avast.com
>>>>>> <https://www.avast.com/sig-email?utm_medium=email&utm_
>>> source=link&utm_campaign=sig-email&utm_content=emailclient>
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>>>>> Searchable archive at
>>>>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>>>> Searchable archive at
>>>>> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>>> 
>>> 
>>>         [[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> R-sig-phylo mailing list - R-sig-phylo@r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
>>> Searchable archive at http://www.mail-archive.com/r-
>>> sig-ph...@r-project.org/
>>> 


        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

Reply via email to