# [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

```I've been working with some very large trees (tens to hundreds of thousands of
tips) where the speed of APE's getMRCA function has been prohibitively slow. I
therefore R-ified an MRCA function I developed in Java for use with the Open
Tree of Life tree (~3 million tips). I don't pretend to be an algorithm guy,
but this seems to work.```
```
Given a tree `phy` and a set of query taxa `tips`, here is the algorithm (note
that the terminology has the root at the top of the tree):
1) For the first tip, drill up through the tree, recording its entire lineage
(parental nodes) to the root in a vector. This seems wasteful (i.e., if the
MRCA is not the root) but in practice is actually fast. Call this vector `pars`.
2) For each subsequent tip, traverse its lineage up in the tree until
encountering an ancestor in `pars`.
3) Return the highest indexed node in `pars` that was traversed by the other
lineages.

A convenient "early exit" of this approach is that if any of the MRCAs are the
root then we can stop immediately. Another benefit with this is that the query
nodes do not necessarily have to be terminal taxa: MRCAs can be found among
terminal and internal nodes.

So, how does it fare? Pretty good! Here are some results on a 100K tip tree
(the tree I am actually working with has 350K tips).

# set up example
set.seed(13);
phy <- rtree(100000);
node <- 100003; # make sure MRCA is not the root
tt <- unlist(Descendants(phy, 100003));
set.seed(13);
tips <- sample(tt, 50);

# APE
system.time(nd.ape <- getMRCA(phy, tips)); nd.ape;
user  system elapsed
14.459   0.066  14.482
[1] 100003

# ALT
system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
user  system elapsed
2.557   0.534   3.063
[1] 100003

Ok, that was for a shallow MRCA node (i.e. close to the root). How does it fare
for recent MRCAs? Here is a timing for sister tips:

# APE
system.time(nd.ape <- getMRCA(phy, c(13,14))); nd.ape;
user  system elapsed
14.715   0.063  14.727
[1] 100027

# ALT
system.time(nd.alt <- alt_mrca(phy, c(13,14))); nd.alt;
user  system elapsed
0.047   0.015   0.061
[1] 100027

Way faster! Getting the entire lineage of the first tip does not seem to hurt
us.

We can also look how the timing scales as the size of the query set changes
(I've made sure that the MRCA is identical in every case):

(APE is blue, alt_mrca is red). So the APE version seems insensitive to the
number of tips query. I suppose there must be a point where the timings cross
and the alt_mrca function takes more time, but I have not found it. Besides, if
the query set becomes large enough then the MRCA is likely the root itself, and
in that case we may exit early:

# APE with just 2 tips
system.time(nd.ape <- getMRCA(phy, c(1,100000))); nd.ape;
user  system elapsed
14.627   0.055  14.642
[1] 100001

# ALT with a random 500 tips
system.time(nd.alt <- alt_mrca(phy, sample(1:100000, 500))); nd.alt;
user  system elapsed
0.292   0.058   0.346
[1] 100001

Not bad. The code for the alternate MRCA is below. I've considered lapply-ing
this (i.e. across query nodes), but that would negate the potential early exit
(as all tip-to-root lineages would have to be traversed). Note that in all of
these timings above the APE getMRCA function uses C code but the alternative is
pure R. I imagine the alt_mrca could be made even speedier if coded in C.

HTH.
Joseph.

alt_mrca <- function (phy, tips) {
rootnd <- length(phy\$tip.label) + 1;
pars <- numeric(); # parents in lineage of first queried tip
mrcaind <- 1;      # index in pars of the mrca node
tnd <- NULL;       # node ids of tips
if ("character" %in% class(tips)) {
tnd <- match(tips, phy\$tip.label);
} else {
tnd <- tips;
}

# get entire lineage for first tip
nd <- tnd[1];
repeat {
nd <- phy\$edge[which(phy\$edge[,2] == nd),1]; # get immediate parent
pars <- c(pars, nd);
if (nd == rootnd) {
break;
}
}

# traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE;
cnd <- tnd[i];
while (!done) {
cpar <- phy\$edge[which(phy\$edge[,2] == cnd),1]; # get immediate
parent
if (cpar %in% pars) {
if (cpar == rootnd) {
# early exit! if the mrca of any set of taxa is the root
then we are done
return(rootnd);
}
idx <- which(pars == cpar);
if (idx > mrcaind) {
mrcaind <- idx;
}
done <- TRUE;
} else {
# keep going!
cnd <- cpar;
}
}
}
return(pars[mrcaind]);
}

________________________________________
Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu

```
```_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/```