Hi Joseph,

just a few changes on your code and it is actually really fast. Having
multiple calls to which() is often slow.
Replacing which() with a simple lookup table will speed your code up quite
a bit as you can see below.
This is actually how the function Ancestors() in phangorn works.
But let's start with the output from profiling your code and you can see
that alt_mrca() spends most of the time in which().

Rprof(tmp <- tempfile())
nd.alt <- alt_mrca(phy, tips)
Rprof()
summaryRprof(tmp)
unlink(tmp)

$by.self
        self.time self.pct total.time total.pct
"which"      1.56      100       1.56       100

$by.total
           total.time total.pct self.time self.pct
"which"          1.56       100      1.56      100
"alt_mrca"       1.56       100      0.00        0

$sample.interval
[1] 0.02

$sampling.time
[1] 1.56

Now here are some timings in comparison this version with one where I
replaced calls to which() with a lookup table:

# ALT
system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
   user  system elapsed
  1.552   0.004   1.555
[1] 100003
>
# ALT FAST
system.time(nd.alt <- alt_mrca_fast(phy, tips)); nd.alt;
   user  system elapsed
  0.004   0.000   0.004
[1] 100003

Not too bad for some pure R code.
And now you really need to start thinking of another excuse for a coffee
break.

Cheers,
Klaus

PS: Emmanuel, just another function which could get into next version.
PPS: The plot should better have a logarithmic scale for time.
PPPS: Here is the improved code, just 2 additional lines.


alt_mrca_fast <- function (phy, tips) {
    rootnd <- length(phy$tip.label) + 1;
    pars <- numeric(); # parents in lineage of first queried tip
    mrcaind <- 1;      # index in pars of the mrca node
    tnd <- NULL;       # node ids of tips
    if ("character" %in% class(tips)) {
        tnd <- match(tips, phy$tip.label);
    } else {
        tnd <- tips;
    }

    # build a lookup table to get parents faster
    pvec <- integer(max(phy$edge))
    pvec[phy$edge[,2]] = phy$edge[,1]

    # get entire lineage for first tip
    nd <- tnd[1];
    repeat {
#        nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent
        nd <- pvec[nd]
        pars <- c(pars, nd);
        if (nd == rootnd) {
            break;
        }
    }


    # traverse lineages for remaining tips, stop if hit common ancestor
    for (i in 2:length(tnd)) {
        done <- FALSE;
        cnd <- tnd[i];
        while (!done) {
#            cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get
immediate parent
            cpar <- pvec[cnd]
            if (cpar %in% pars) {
                if (cpar == rootnd) {
                    # early exit! if the mrca of any set of taxa is the
root then we are done
                    return(rootnd);
                }
                idx <- which(pars == cpar);
                if (idx > mrcaind) {
                    mrcaind <- idx;
                }
                done <- TRUE;
            } else {
                # keep going!
                cnd <- cpar;
            }
        }
    }
    return(pars[mrcaind]);
}




On Wed, Jun 7, 2017 at 4:24 PM, Joseph W. Brown <josep...@umich.edu> wrote:

> I've been working with some very large trees (tens to hundreds of
> thousands of tips) where the speed of APE's getMRCA function has been
> prohibitively slow. I therefore R-ified an MRCA function I developed in
> Java for use with the Open Tree of Life tree (~3 million tips). I don't
> pretend to be an algorithm guy, but this seems to work.
>
> Given a tree `phy` and a set of query taxa `tips`, here is the algorithm
> (note that the terminology has the root at the top of the tree):
> 1) For the first tip, drill up through the tree, recording its entire
> lineage (parental nodes) to the root in a vector. This seems wasteful
> (i.e., if the MRCA is *not* the root) but in practice is actually fast.
> Call this vector `pars`.
>

Even computing all ancestors is actually not so slow:
system.time(tmp <- Ancestors(phy, 1:100000))
   user  system elapsed
  0.084   0.000   0.085
You need to traverse the whole tree only once, but you may run out of
memory for larger trees.


2) For each subsequent tip, traverse its lineage up in the tree until
> encountering an ancestor in `pars`.
> 3) Return the highest indexed node in `pars` that was traversed by the
> other lineages.
>
> A convenient "early exit" of this approach is that if *any* of the MRCAs
> are the root then we can stop immediately. Another benefit with this is
> that the query nodes do *not* necessarily have to be terminal taxa: MRCAs
> can be found among terminal and internal nodes.
>
> So, how does it fare? Pretty good! Here are some results on a 100K tip
> tree (the tree I am actually working with has 350K tips).
>
> # set up example
> set.seed(13);
> phy <- rtree(100000);
> node <- 100003; # make sure MRCA is not the root
> tt <- unlist(Descendants(phy, 100003));
> set.seed(13);
> tips <- sample(tt, 50);
>
> # APE
> system.time(nd.ape <- getMRCA(phy, tips)); nd.ape;
>    user  system elapsed
>  14.459   0.066  14.482
> [1] 100003
>
> # ALT
> system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
>    user  system elapsed
>   2.557   0.534   3.063
> [1] 100003
>
> Ok, that was for a shallow MRCA node (i.e. close to the root). How does it
> fare for recent MRCAs? Here is a timing for sister tips:
>
> # APE
> system.time(nd.ape <- getMRCA(phy, c(13,14))); nd.ape;
>    user  system elapsed
>  14.715   0.063  14.727
> [1] 100027
>
> # ALT
> system.time(nd.alt <- alt_mrca(phy, c(13,14))); nd.alt;
>    user  system elapsed
>   0.047   0.015   0.061
> [1] 100027
>
> Way faster! Getting the entire lineage of the first tip does not seem to
> hurt us.
>
> We can also look how the timing scales as the size of the query set
> changes (I've made sure that the MRCA is identical in every case):
> (APE is blue, alt_mrca is red). So the APE version seems insensitive to
> the number of tips query. I suppose there must be a point where the timings
> cross and the alt_mrca function takes more time, but I have not found it.
> Besides, if the query set becomes large enough then the MRCA is likely the
> root itself, and in that case we may exit early:
>
> # APE with just 2 tips
> system.time(nd.ape <- getMRCA(phy, c(1,100000))); nd.ape;
>    user  system elapsed
>  14.627   0.055  14.642
> [1] 100001
>
> # ALT with a random 500 tips
> system.time(nd.alt <- alt_mrca(phy, sample(1:100000, 500))); nd.alt;
>    user  system elapsed
>   0.292   0.058   0.346
> [1] 100001
>
> Not bad. The code for the alternate MRCA is below. I've considered
> lapply-ing this (i.e. across query nodes), but that would negate the
> potential early exit (as *all* tip-to-root lineages would have to be
> traversed). Note that in all of these timings above the APE getMRCA
> function uses C code but the alternative is pure R. I imagine the alt_mrca
> could be made even speedier if coded in C.
>
> HTH.
> Joseph.
>
> alt_mrca <- function (phy, tips) {
>     rootnd <- length(phy$tip.label) + 1;
>     pars <- numeric(); # parents in lineage of first queried tip
>     mrcaind <- 1;      # index in pars of the mrca node
>     tnd <- NULL;       # node ids of tips
>     if ("character" %in% class(tips)) {
>         tnd <- match(tips, phy$tip.label);
>     } else {
>         tnd <- tips;
>     }
>
>
>     # get entire lineage for first tip
>     nd <- tnd[1];
>     repeat {
>         nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate
> parent
>         pars <- c(pars, nd);
>         if (nd == rootnd) {
>             break;
>         }
>     }
>
>
>     # traverse lineages for remaining tips, stop if hit common ancestor
>     for (i in 2:length(tnd)) {
>         done <- FALSE;
>         cnd <- tnd[i];
>         while (!done) {
>             cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get
> immediate parent
>             if (cpar %in% pars) {
>                 if (cpar == rootnd) {
>                     # early exit! if the mrca of any set of taxa is the
> root then we are done
>                     return(rootnd);
>                 }
>                 idx <- which(pars == cpar);
>                 if (idx > mrcaind) {
>                     mrcaind <- idx;
>                 }
>                 done <- TRUE;
>             } else {
>                 # keep going!
>                 cnd <- cpar;
>             }
>         }
>     }
>     return(pars[mrcaind]);
> }
>
> ________________________________________
> Joseph W. Brown
> Post-doctoral Researcher, Smith Laboratory
> University of Michigan
> Department of Ecology & Evolutionary Biology
> Room 2071, Kraus Natural Sciences Building
> Ann Arbor MI 48109-1079
> josep...@umich.edu
>
>
>
>
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at http://www.mail-archive.com/r-
> sig-ph...@r-project.org/
>



-- 
Klaus Schliep
Postdoctoral Fellow
Revell Lab, University of Massachusetts Boston
http://www.phangorn.org/
_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Reply via email to