I've been working with some very large trees (tens to hundreds of thousands of 
tips) where the speed of APE's getMRCA function has been prohibitively slow. I 
therefore R-ified an MRCA function I developed in Java for use with the Open 
Tree of Life tree (~3 million tips). I don't pretend to be an algorithm guy, 
but this seems to work.

Given a tree `phy` and a set of query taxa `tips`, here is the algorithm (note 
that the terminology has the root at the top of the tree):
1) For the first tip, drill up through the tree, recording its entire lineage 
(parental nodes) to the root in a vector. This seems wasteful (i.e., if the 
MRCA is not the root) but in practice is actually fast. Call this vector `pars`.
2) For each subsequent tip, traverse its lineage up in the tree until 
encountering an ancestor in `pars`. 
3) Return the highest indexed node in `pars` that was traversed by the other 
lineages.

A convenient "early exit" of this approach is that if any of the MRCAs are the 
root then we can stop immediately. Another benefit with this is that the query 
nodes do not necessarily have to be terminal taxa: MRCAs can be found among 
terminal and internal nodes.

So, how does it fare? Pretty good! Here are some results on a 100K tip tree 
(the tree I am actually working with has 350K tips).

# set up example
set.seed(13);
phy <- rtree(100000);
node <- 100003; # make sure MRCA is not the root
tt <- unlist(Descendants(phy, 100003));
set.seed(13);
tips <- sample(tt, 50);

# APE
system.time(nd.ape <- getMRCA(phy, tips)); nd.ape;
   user  system elapsed 
 14.459   0.066  14.482 
[1] 100003

# ALT
system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
   user  system elapsed 
  2.557   0.534   3.063 
[1] 100003

Ok, that was for a shallow MRCA node (i.e. close to the root). How does it fare 
for recent MRCAs? Here is a timing for sister tips:

# APE
system.time(nd.ape <- getMRCA(phy, c(13,14))); nd.ape;
   user  system elapsed 
 14.715   0.063  14.727 
[1] 100027

# ALT
system.time(nd.alt <- alt_mrca(phy, c(13,14))); nd.alt;
   user  system elapsed 
  0.047   0.015   0.061
[1] 100027

Way faster! Getting the entire lineage of the first tip does not seem to hurt 
us.

We can also look how the timing scales as the size of the query set changes 
(I've made sure that the MRCA is identical in every case):

(APE is blue, alt_mrca is red). So the APE version seems insensitive to the 
number of tips query. I suppose there must be a point where the timings cross 
and the alt_mrca function takes more time, but I have not found it. Besides, if 
the query set becomes large enough then the MRCA is likely the root itself, and 
in that case we may exit early:

# APE with just 2 tips
system.time(nd.ape <- getMRCA(phy, c(1,100000))); nd.ape;
   user  system elapsed 
 14.627   0.055  14.642 
[1] 100001

# ALT with a random 500 tips
system.time(nd.alt <- alt_mrca(phy, sample(1:100000, 500))); nd.alt;
   user  system elapsed 
  0.292   0.058   0.346 
[1] 100001

Not bad. The code for the alternate MRCA is below. I've considered lapply-ing 
this (i.e. across query nodes), but that would negate the potential early exit 
(as all tip-to-root lineages would have to be traversed). Note that in all of 
these timings above the APE getMRCA function uses C code but the alternative is 
pure R. I imagine the alt_mrca could be made even speedier if coded in C.

HTH.
Joseph.

alt_mrca <- function (phy, tips) {
    rootnd <- length(phy$tip.label) + 1;
    pars <- numeric(); # parents in lineage of first queried tip
    mrcaind <- 1;      # index in pars of the mrca node
    tnd <- NULL;       # node ids of tips
    if ("character" %in% class(tips)) {
        tnd <- match(tips, phy$tip.label);
    } else {
        tnd <- tips;
    }
    
    # get entire lineage for first tip
    nd <- tnd[1];
    repeat {
        nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent
        pars <- c(pars, nd);
        if (nd == rootnd) {
            break;
        }
    }
    
    # traverse lineages for remaining tips, stop if hit common ancestor
    for (i in 2:length(tnd)) {
        done <- FALSE;
        cnd <- tnd[i];
        while (!done) {
            cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate 
parent
            if (cpar %in% pars) {
                if (cpar == rootnd) {
                    # early exit! if the mrca of any set of taxa is the root 
then we are done
                    return(rootnd);
                }
                idx <- which(pars == cpar);
                if (idx > mrcaind) {
                    mrcaind <- idx;
                }
                done <- TRUE;
            } else {
                # keep going!
                cnd <- cpar;
            }
        }
    }
    return(pars[mrcaind]);
}

________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu



_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to