Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

Emmanuel Paradis Sat, 10 Jun 2017 10:02:13 -0700

Joseph,

Really cool. I updated the code in ape.


About (byte-)compiling, this is usually done during installing the package:

http://ape-package.ird.fr/ape_installation.html#linux

I guess Win and Mac versions from CRAN are byte-compiled. To check if afunction is compiled, just print it in R, it will show the bytecode atthe end of the display:


R> rtree
....
<bytecode: 0x41a1548>
<environment: namespace:ape>

Besides, the NEWS for R 3.4.0 says:

"The JIT (‘Just In Time’) byte-code compiler is now enabled by defaultat its level 3. This means functions will be compiled on first or seconduse and top-level loops will be compiled and then run. (Thanks to TomasKalibera for extensive work to make this possible.)"


This seems to be effective:

R> foo <- function(x) x
R> foo(1)
[1] 1
R> foo
function(x) x
R> foo(1)
[1] 1
R> foo
function(x) x
<bytecode: 0x604bda8>

Best,

Emmanuel

Le 10/06/2017 à 14:29, Joseph W. Brown a écrit :

Emmanuel and Klaus,

After Klaus reminded me that `which` was so inefficient I realized usingit in finding the match in the first lineage is unnecessary. Instead, wecan capitalize on the fact that shallower nodes have smaller nodeids. SoI threw out the vector index `mrcaind` altogether:


getMRCA_new<- function(phy, tip) {
if (!inherits(phy, "phylo"))
        stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars<- integer(phy$Nnode) # worst case assignment, usually far too long
tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
    }
pars<- pars[1:k] # delete the rest
mrcand <- pars[1]; # keep track of actual node rather than index in pars


## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE
cnd <- tnd[i]
while (!done) {
cpar<- pvec[cnd] # get immediate parent
if (cpar %in% pars) {
if (cpar == rootnd) return(rootnd) # early exit
mrcand <- min(mrcand,cpar); # shallowest node has lowest nodeid
done <- TRUE
            }
cnd <- cpar # keep going!
        }
    }
mrcand
}

I also played around with compiling the function:

library(compiler);
getMRCA_em_cmp<- cmpfun(getMRCA_em);
getMRCA_new_cmp<- cmpfun(getMRCA_new);

Here are some timings (*_em is your code, *_new is the code above). Formy original example (100K tips, 500 tips in MRCA query) there is not awhole lot of difference:

microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips),getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips));

Unit: milliseconds
expr min lq mean median uq max neval

getMRCA_em(phy, tips)23.9987929.8196433.6743732.3914935.61888136.34521100getMRCA_new(phy, tips)23.7694128.0250031.8428531.3700134.8964044.66380100getMRCA_em_cmp(phy, tips)20.8192423.6011828.2395728.2520230.1853559.72102100getMRCA_new_cmp(phy, tips)20.1658823.1957127.1961026.6578329.8214942.61702100

However, for the worst-case scenario that Klaus devised the results arevery impressive:

microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)),getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2)));

Unit: milliseconds
expr min lq mean median uq max neval

getMRCA_em(tree, c(1, 2))124.79040140.44473146.98292145.39224150.96044286.78335100getMRCA_new(tree, c(1, 2))123.39924139.41609145.21311143.92062148.34205297.73846100getMRCA_em_cmp(tree, c(1, 2))23.7210526.8494429.9495029.8793532.2649041.99264100getMRCA_new_cmp(tree, c(1, 2))23.1522325.7468629.2152629.2537030.8091569.68522100


So it might be worth compiling.

HTH.
JWB
________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu <mailto:josep...@umich.edu>

On 10 Jun, 2017, at 05:26, Emmanuel Paradis <emmanuel.para...@ird.fr<mailto:emmanuel.para...@ird.fr>> wrote:


Joseph, Klaus,

This is great. I slightly edited the code and renamed the argument'tips' to 'tip' (as in the original definition). I run on a bunch ofrandom trees and the results are exactly identical with the currentversion of getMRCA. I'm changing the code in ape now.


Cheers,

Emmanuel

getMRCA <- function(phy, tip)
{
   if (!inherits(phy, "phylo"))
       stop('object "phy" is not of class "phylo"')
   if (length(tip) < 2) return(NULL)
   Ntip <- length(phy$tip.label)
   rootnd <- Ntip + 1L

pars <- integer(phy$Nnode) # worst case assignment, usually far toolongmrcaind <- 1L # index in pars of the mrca node. will return highestone traversed by other lineages

   tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

   ## build a lookup table to get parents faster
   pvec <- integer(Ntip + phy$Nnode)
   pvec[phy$edge[, 2]] <- phy$edge[, 1]

   ## get entire lineage for first tip
   nd <- tnd[1]
   for (k in 1:phy$Nnode) {
       nd <- pvec[nd]
       pars[k] <- nd
       if (nd == rootnd) break
   }
   pars <- pars[1:k] # delete the rest

   ## traverse lineages for remaining tips, stop if hit common ancestor
   for (i in 2:length(tnd)) {
       done <- FALSE
       cnd <- tnd[i]
       while (!done) {
           cpar <- pvec[cnd] # get immediate parent
           if (cpar %in% pars) {

## early exit! if the mrca of any set of taxa is theroot then we are done

               if (cpar == rootnd) return(rootnd)
               idx <- which(pars == cpar)
               if (idx > mrcaind)  mrcaind <- idx
               done <- TRUE
           } else cnd <- cpar # keep going!
       }
   }
   pars[mrcaind]
}


Le 10/06/2017 à 00:28, Klaus Schliep a écrit :

Hi Joseph
I guess I figured out where some room for improvement was.
You grow the vector pars <- c(pars, nd)

I preallocated pars, but you could even just grow pars using pars[k]<- nd

without pre-allocation. This operation is now (with R >= 3.4) far less
costly as mentioned in the NEWS:
"Assigning to an element of a vector beyond the current length now

over-allocates by a small fraction. The new vector is markedinternally asgrowable, and the true length of the new vector is stored in thetruelength

field. This makes building up a vector result by assigning to the next

element beyond the current length more efficient, thoughpre-allocating is

still preferred. The implementation is subject to change and not intended
to be used in packages at this time."
One of the many little improvements recently making R quite a bit faster
and easier.
So the worst case behavior was pretty bad
tree <- stree(100000, "right")
system.time(get_mrca_fast(tree, c(1,2)))
   user  system elapsed
 13.424   0.012  13.440
 mrca
[1] 199999
with the changes it is much better:
system.time(mrca <- get_mrca_fast(tree, c(1,2)))
   user  system elapsed
  0.012   0.000   0.012

mrca

[1] 199999
And here is the code:
get_mrca_fast <- function (phy, tips) {
    rootnd <- length(phy$tip.label) + 1;
    pars <- integer(phy$Nnode);  # worst case assignment, usually far too
long
    mrcaind <- 1;      # index in pars of the mrca node. will return
highest one traversed by other lineages
    tnd <- NULL;       # node ids of tips
    if ("character" %in% class(tips)) {
        tnd <- match(tips, phy$tip.label);
    } else {
        tnd <- tips;
    }
    # build a lookup table to get parents faster
    pvec <- integer(max(phy$edge));
    pvec[phy$edge[,2]] <- phy$edge[,1];
    # get entire lineage for first tip
    nd <- tnd[1];
    for(k in 1:phy$Nnode){
        nd <- pvec[nd]
        pars[k] <- nd
        if (nd == rootnd) break
    }
    pars <- pars[1:k] #delete the rest
    # traverse lineages for remaining tips, stop if hit common ancestor
    for (i in 2:length(tnd)) {
        done <- FALSE;
        cnd <- tnd[i];
        while (!done) {
            cpar <- pvec[cnd]; # get immediate parent
            if (cpar %in% pars) {
                if (cpar == rootnd) {
                    # early exit! if the mrca of any set of taxa is the
root then we are done
                    return(rootnd);
                }
                idx <- which(pars == cpar);
                if (idx > mrcaind) {
                    mrcaind <- idx;
                }
                done <- TRUE;
            } else {
                # keep going!
                cnd <- cpar;
            }
        }
    }
    return(pars[mrcaind]);
}
Have a nice weekend,
Klaus

On Fri, Jun 9, 2017 at 3:59 PM, Joseph W. Brown <josep...@umich.edu<mailto:josep...@umich.edu>> wrote:

Liam,

I put the (updated) code up as a gist <https://gist.github.com/josephwb/
ad61fd29ed4fb06e712e23d67422c813>. Feel free to use/improve/whatever.

Emmanuel sees room for improvement but frankly taking 0.3 seconds tofindthe MRCA of 5000 taxa on a million-tip tree is good enough for mypurposes.


JWB
________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu <mailto:josep...@umich.edu>

On 9 Jun, 2017, at 15:20, Liam J. Revell <liam.rev...@umb.edu> wrote:

On the other hand, phytools does have a function - the somewhat

imprecisely named fastMRCA - which can find the MRCA of just a pair of

species much faster than getMRCA (however still slower than or onlyabout

as fast as Joseph & Klaus's solutions).


Liam J. Revell, Associate Professor of Biology
University of Massachusetts Boston
web: http://faculty.umb.edu/liam.revell/
email: liam.rev...@umb.edu
blog: http://blog.phytools.org

On 6/9/2017 5:22 AM, Liam J. Revell wrote:


Juan. findMRCA was written before getMRCA existed, but the latter was
faster so now it just calls getMRCA internally. All the best, Liam

Liam J. Revell, Associate Professor of Biology
University of Massachusetts Boston
web: http://faculty.umb.edu/liam.revell/
email: liam.rev...@umb.edu
blog: http://blog.phytools.org

On 6/9/2017 1:57 AM, Juan Antonio Balbuena wrote:

Package phytools includes a function, findMRCA, that is supposed to

work

very efficiently with large trees. you may wish to compare itwith your

function.

Cheers

Juan

--

Dr. Juan A. Balbuena
Cavanilles Institute of Biodiversity and Evolutionary Biology
University of Valencia
http://www.uv.es/~balbuena <http://www.uv.es/%7Ebalbuena>
P.O. Box 22085
http://www.uv.es/cophylpaco
<http://www.uv.es/cavanilles/zoomarin/index.htm>
46071 Valencia, Spain

e-mail: j.a.balbu...@uv.es <mailto:j.a.balbu...@uv.es> tel.+34 963

543 658    fax +34 963 543 733
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
*NOTE!*For shipments by EXPRESS COURIER use the following street

address:

C/ Catedrático José Beltrán 2, 46980 Paterna (Valencia), Spain.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


<https://www.avast.com/sig-email?utm_medium=email&utm_

source=link&utm_campaign=sig-email&utm_content=emailclient>


     Libre de virus. www.avast.com
<https://www.avast.com/sig-email?utm_medium=email&utm_

source=link&utm_campaign=sig-email&utm_content=emailclient>




<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at
http://www.mail-archive.com/r-sig-phylo@r-project.org/



        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-
sig-ph...@r-project.org/


_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

Reply via email to