Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Klaus Schliep
Hi Joseph & Co,
don't trust that shallowest node has lowest node id. That is a byproduct by
functions like rtree() or read.tree(), but must not be generally true. E.g.
after tree rearrangements.
There is nothing wrong with this tree:

library(ape)
tree <- structure(list(edge = structure(c(6L, 6L, 8L, 7L, 7L, 8L, 6L,
  1L, 8L, 7L, 2L, 3L, 4L, 5L), .Dim = c(7L, 2L)), tip.label = c("t1", "t2",
"t3", "t4", "t5"), Nnode = 3L), .Names = c("edge", "tip.label", "Nnode"),
class = "phylo", order = "cladewise")
plot(tree)
nodelabels()
getMRCA_new(tree, c(2:4))
[1] 7

So here is my last version:

getMRCA <- function(phy, tip) {
if (!inherits(phy, "phylo"))
stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars <- integer(phy$Nnode) # worst case assignment, usually far too long
tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

done_v <- logical(max(phy$edge))

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
}
pars <- pars[1:k] # delete the rest
mrcind <- integer(max(pars))
mrcind[pars] <- 1:k

mrcand <- pars[1]

## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
cnd <- tnd[i]
done <- done_v[cnd]
while(!done){
done_v[cnd] <- TRUE
cpar <- pvec[cnd] # get immediate parent
done <- done_v[cpar] # early exit if TRUE
if (cpar %in% pars) {
if (cpar == rootnd) return(rootnd) # early exit
if(mrcind[cpar] > mrcind[mrcand]) mrcand <- cpar
done_v[cpar] <- TRUE
done <- TRUE
}
cnd <- cpar # keep going!
}
}
mrcand
}


It exists also early if an internal node was traversed before hand, so the
following code is a bit faster now.

library(ape)
tree <- stree(2^17, "balanced")
system.time(getMRCA(tree, 1:2^16))

Cheers,
Klaus




On Sat, Jun 10, 2017 at 1:01 PM, Emmanuel Paradis 
wrote:

> Joseph,
>
> Really cool. I updated the code in ape.
>
> About (byte-)compiling, this is usually done during installing the package:
>
> http://ape-package.ird.fr/ape_installation.html#linux
>
> I guess Win and Mac versions from CRAN are byte-compiled. To check if a
> function is compiled, just print it in R, it will show the bytecode at the
> end of the display:
>
> R> rtree
> 
> 
> 
>
> Besides, the NEWS for R 3.4.0 says:
>
> "The JIT (‘Just In Time’) byte-code compiler is now enabled by default at
> its level 3. This means functions will be compiled on first or second use
> and top-level loops will be compiled and then run. (Thanks to Tomas
> Kalibera for extensive work to make this possible.)"

This seems to be effective:
>
> R> foo <- function(x) x
> R> foo(1)
> [1] 1
> R> foo
> function(x) x
> R> foo(1)
> [1] 1
> R> foo
> function(x) x
> 
>
> Best,
>
> Emmanuel
>
>
> Le 10/06/2017 à 14:29, Joseph W. Brown a écrit :
>
>> Emmanuel and Klaus,
>>
>> After Klaus reminded me that `which` was so inefficient I realized using
>> it in finding the match in the first lineage is unnecessary. Instead, we
>> can capitalize on the fact that shallower nodes have smaller nodeids. So I
>> threw out the vector index `mrcaind` altogether:
>>
>> getMRCA_new<- function(phy, tip) {
>> if (!inherits(phy, "phylo"))
>> stop('object "phy" is not of class "phylo"')
>> if (length(tip) < 2) return(NULL)
>> Ntip <- length(phy$tip.label)
>> rootnd <- Ntip + 1L
>>
>> pars<- integer(phy$Nnode) # worst case assignment, usually far too long
>> tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip
>>
>> ## build a lookup table to get parents faster
>> pvec <- integer(Ntip + phy$Nnode)
>> pvec[phy$edge[, 2]] <- phy$edge[, 1]
>>
>> ## get entire lineage for first tip
>> nd <- tnd[1]
>> for (k in 1:phy$Nnode) {
>> nd <- pvec[nd]
>> pars[k] <- nd
>> if (nd == rootnd) break
>> }
>> pars<- pars[1:k] # delete the rest
>> mrcand <- pars[1]; # keep track of actual node rather than index in pars
>>
>>
>> ## traverse lineages for remaining tips, stop if hit common ancestor
>> for (i in 2:length(tnd)) {
>> done <- FALSE
>> cnd <- tnd[i]
>> while (!done) {
>> cpar<- pvec[cnd] # get immediate parent
>> if (cpar %in% pars) {
>> if (cpar == rootnd) return(rootnd) # early exit
>> mrcand <- min(mrcand,cpar); # shallowest node has lowest nodeid
>>
>> done <- TRUE
>> }
>> cnd <- cpar # keep going!
>> }
>> }
>> mrcand
>> }
>>
>> I also played around with compiling the function:
>>
>> library(compiler);
>> getMRCA_em_cmp<- cmpfun(getMRCA_em);
>> getMRCA_new_cmp<- cmpfun(getMRCA_new);
>>
>> Here are some timings (*_em 

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Emmanuel Paradis

Joseph,

Really cool. I updated the code in ape.

About (byte-)compiling, this is usually done during installing the package:

http://ape-package.ird.fr/ape_installation.html#linux

I guess Win and Mac versions from CRAN are byte-compiled. To check if a 
function is compiled, just print it in R, it will show the bytecode at 
the end of the display:


R> rtree




Besides, the NEWS for R 3.4.0 says:

"The JIT (‘Just In Time’) byte-code compiler is now enabled by default 
at its level 3. This means functions will be compiled on first or second 
use and top-level loops will be compiled and then run. (Thanks to Tomas 
Kalibera for extensive work to make this possible.)"


This seems to be effective:

R> foo <- function(x) x
R> foo(1)
[1] 1
R> foo
function(x) x
R> foo(1)
[1] 1
R> foo
function(x) x


Best,

Emmanuel

Le 10/06/2017 à 14:29, Joseph W. Brown a écrit :

Emmanuel and Klaus,

After Klaus reminded me that `which` was so inefficient I realized using 
it in finding the match in the first lineage is unnecessary. Instead, we 
can capitalize on the fact that shallower nodes have smaller nodeids. So 
I threw out the vector index `mrcaind` altogether:


getMRCA_new<- function(phy, tip) {
if (!inherits(phy, "phylo"))
stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars<- integer(phy$Nnode) # worst case assignment, usually far too long
tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
}
pars<- pars[1:k] # delete the rest
mrcand <- pars[1]; # keep track of actual node rather than index in pars


## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE
cnd <- tnd[i]
while (!done) {
cpar<- pvec[cnd] # get immediate parent
if (cpar %in% pars) {
if (cpar == rootnd) return(rootnd) # early exit
mrcand <- min(mrcand,cpar); # shallowest node has lowest nodeid
done <- TRUE
}
cnd <- cpar # keep going!
}
}
mrcand
}

I also played around with compiling the function:

library(compiler);
getMRCA_em_cmp<- cmpfun(getMRCA_em);
getMRCA_new_cmp<- cmpfun(getMRCA_new);

Here are some timings (*_em is your code, *_new is the code above). For 
my original example (100K tips, 500 tips in MRCA query) there is not a 
whole lot of difference:


microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips), 
getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips));

Unit: milliseconds
expr min lq mean median uq max neval
   getMRCA_em(phy, tips) 
23.9987929.8196433.6743732.3914935.61888136.34521100
  getMRCA_new(phy, tips) 
23.7694128.0250031.8428531.3700134.8964044.66380100
   getMRCA_em_cmp(phy, tips) 
20.8192423.6011828.2395728.2520230.1853559.72102100
  getMRCA_new_cmp(phy, tips) 
20.1658823.1957127.1961026.6578329.8214942.61702100


However, for the worst-case scenario that Klaus devised the results are 
very impressive:


microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)), 
getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2)));

Unit: milliseconds
expr min lq mean median uq max neval
   getMRCA_em(tree, c(1, 2)) 
124.79040140.44473146.98292145.39224150.96044286.78335100
  getMRCA_new(tree, c(1, 2)) 
123.39924139.41609145.21311143.92062148.34205297.73846100
   getMRCA_em_cmp(tree, c(1, 2)) 
23.7210526.8494429.9495029.8793532.2649041.99264100
  getMRCA_new_cmp(tree, c(1, 2)) 
23.1522325.7468629.2152629.2537030.8091569.68522100


So it might be worth compiling.

HTH.
JWB

Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu 



On 10 Jun, 2017, at 05:26, Emmanuel Paradis > wrote:


Joseph, Klaus,

This is great. I slightly edited the code and renamed the argument 
'tips' to 'tip' (as in the original definition). I run on a bunch of 
random trees and the results are exactly identical with the current 
version of getMRCA. I'm changing the code in ape now.


Cheers,

Emmanuel

getMRCA <- function(phy, tip)
{
   if (!inherits(phy, "phylo"))
   stop('object "phy" is not of class "phylo"')
   if (length(tip) < 2) return(NULL)
   Ntip <- length(phy$tip.label)
   rootnd <- Ntip + 1L

   pars <- integer(phy$Nnode) # worst case assignment, usually far too 
long
   mrcaind <- 1L # index in pars of the mrca node. will return highest 
one traversed by other lineages

   tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

   ## build a lookup table to get parents faster
   pvec <- 

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Joseph W. Brown
Emmanuel and Klaus,

After Klaus reminded me that `which` was so inefficient I realized using it in 
finding the match in the first lineage is unnecessary. Instead, we can 
capitalize on the fact that shallower nodes have smaller nodeids. So I threw 
out the vector index `mrcaind` altogether:

getMRCA_new <- function(phy, tip) {
   if (!inherits(phy, "phylo"))
   stop('object "phy" is not of class "phylo"')
   if (length(tip) < 2) return(NULL)
   Ntip <- length(phy$tip.label)
   rootnd <- Ntip + 1L

   pars <- integer(phy$Nnode) # worst case assignment, usually far too long
   tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

   ## build a lookup table to get parents faster
   pvec <- integer(Ntip + phy$Nnode)
   pvec[phy$edge[, 2]] <- phy$edge[, 1]

   ## get entire lineage for first tip
   nd <- tnd[1]
   for (k in 1:phy$Nnode) {
   nd <- pvec[nd]
   pars[k] <- nd
   if (nd == rootnd) break
   }
   pars <- pars[1:k] # delete the rest
   mrcand <- pars[1]; # keep track of actual node rather than index in pars
   
   ## traverse lineages for remaining tips, stop if hit common ancestor
   for (i in 2:length(tnd)) {
   done <- FALSE
   cnd <- tnd[i]
   while (!done) {
   cpar <- pvec[cnd] # get immediate parent
   if (cpar %in% pars) {
   if (cpar == rootnd) return(rootnd) # early exit
   mrcand <- min(mrcand, cpar); # shallowest node has lowest nodeid
   done <- TRUE
   }
   cnd <- cpar # keep going!
   }
   }
   mrcand
}

I also played around with compiling the function:

library(compiler);
getMRCA_em_cmp <- cmpfun(getMRCA_em);
getMRCA_new_cmp <- cmpfun(getMRCA_new);

Here are some timings (*_em is your code, *_new is the code above). For my 
original example (100K tips, 500 tips in MRCA query) there is not a whole lot 
of difference:

microbenchmark(getMRCA_em(phy, tips), getMRCA_new(phy, tips), 
getMRCA_em_cmp(phy, tips), getMRCA_new_cmp(phy, tips));
Unit: milliseconds
   expr  min   lq mean   median   uq   
max neval
  getMRCA_em(phy, tips) 23.99879 29.81964 33.67437 32.39149 35.61888 
136.34521   100
 getMRCA_new(phy, tips) 23.76941 28.02500 31.84285 31.37001 34.89640  
44.66380   100
  getMRCA_em_cmp(phy, tips) 20.81924 23.60118 28.23957 28.25202 30.18535  
59.72102   100
 getMRCA_new_cmp(phy, tips) 20.16588 23.19571 27.19610 26.65783 29.82149  
42.61702   100

However, for the worst-case scenario that Klaus devised the results are very 
impressive:

microbenchmark(getMRCA_em(tree, c(1,2)), getMRCA_new(tree, c(1,2)), 
getMRCA_em_cmp(tree, c(1,2)), getMRCA_new_cmp(tree, c(1,2)));
Unit: milliseconds
   expr   minlq  meanmedian
uq   max neval
  getMRCA_em(tree, c(1, 2)) 124.79040 140.44473 146.98292 145.39224 
150.96044 286.78335   100
 getMRCA_new(tree, c(1, 2)) 123.39924 139.41609 145.21311 143.92062 
148.34205 297.73846   100
  getMRCA_em_cmp(tree, c(1, 2))  23.72105  26.84944  29.94950  29.87935  
32.26490  41.99264   100
 getMRCA_new_cmp(tree, c(1, 2))  23.15223  25.74686  29.21526  29.25370  
30.80915  69.68522   100

So it might be worth compiling.

HTH.
JWB

Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu



> On 10 Jun, 2017, at 05:26, Emmanuel Paradis  wrote:
> 
> Joseph, Klaus,
> 
> This is great. I slightly edited the code and renamed the argument 'tips' to 
> 'tip' (as in the original definition). I run on a bunch of random trees and 
> the results are exactly identical with the current version of getMRCA. I'm 
> changing the code in ape now.
> 
> Cheers,
> 
> Emmanuel
> 
> getMRCA <- function(phy, tip)
> {
>if (!inherits(phy, "phylo"))
>stop('object "phy" is not of class "phylo"')
>if (length(tip) < 2) return(NULL)
>Ntip <- length(phy$tip.label)
>rootnd <- Ntip + 1L
> 
>pars <- integer(phy$Nnode) # worst case assignment, usually far too long
>mrcaind <- 1L # index in pars of the mrca node. will return highest one 
> traversed by other lineages
>tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip
> 
>## build a lookup table to get parents faster
>pvec <- integer(Ntip + phy$Nnode)
>pvec[phy$edge[, 2]] <- phy$edge[, 1]
> 
>## get entire lineage for first tip
>nd <- tnd[1]
>for (k in 1:phy$Nnode) {
>nd <- pvec[nd]
>pars[k] <- nd
>if (nd == rootnd) break
>}
>pars <- pars[1:k] # delete the rest
> 
>## traverse lineages for remaining tips, stop if hit common ancestor
>for (i in 2:length(tnd)) {
>done <- FALSE
>cnd <- tnd[i]
>while (!done) {
>cpar <- pvec[cnd] # get immediate parent
>if 

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-10 Thread Emmanuel Paradis

Joseph, Klaus,

This is great. I slightly edited the code and renamed the argument 
'tips' to 'tip' (as in the original definition). I run on a bunch of 
random trees and the results are exactly identical with the current 
version of getMRCA. I'm changing the code in ape now.


Cheers,

Emmanuel

getMRCA <- function(phy, tip)
{
if (!inherits(phy, "phylo"))
stop('object "phy" is not of class "phylo"')
if (length(tip) < 2) return(NULL)
Ntip <- length(phy$tip.label)
rootnd <- Ntip + 1L

pars <- integer(phy$Nnode) # worst case assignment, usually far too 
long
mrcaind <- 1L # index in pars of the mrca node. will return highest 
one traversed by other lineages

tnd <- if (is.character(tip)) match(tip, phy$tip.label) else tip

## build a lookup table to get parents faster
pvec <- integer(Ntip + phy$Nnode)
pvec[phy$edge[, 2]] <- phy$edge[, 1]

## get entire lineage for first tip
nd <- tnd[1]
for (k in 1:phy$Nnode) {
nd <- pvec[nd]
pars[k] <- nd
if (nd == rootnd) break
}
pars <- pars[1:k] # delete the rest

## traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE
cnd <- tnd[i]
while (!done) {
cpar <- pvec[cnd] # get immediate parent
if (cpar %in% pars) {
## early exit! if the mrca of any set of taxa is the 
root then we are done

if (cpar == rootnd) return(rootnd)
idx <- which(pars == cpar)
if (idx > mrcaind)  mrcaind <- idx
done <- TRUE
} else cnd <- cpar # keep going!
}
}
pars[mrcaind]
}


Le 10/06/2017 à 00:28, Klaus Schliep a écrit :

Hi Joseph

I guess I figured out where some room for improvement was.
You grow the vector pars <- c(pars, nd)
I preallocated pars, but you could even just grow pars using pars[k] <- nd
without pre-allocation. This operation is now (with R >= 3.4) far less
costly as mentioned in the NEWS:
"Assigning to an element of a vector beyond the current length now
over-allocates by a small fraction. The new vector is marked internally as
growable, and the true length of the new vector is stored in the truelength
field. This makes building up a vector result by assigning to the next
element beyond the current length more efficient, though pre-allocating is
still preferred. The implementation is subject to change and not intended
to be used in packages at this time."

One of the many little improvements recently making R quite a bit faster
and easier.

So the worst case behavior was pretty bad

tree <- stree(10, "right")
system.time(get_mrca_fast(tree, c(1,2)))
user  system elapsed
  13.424   0.012  13.440
  mrca
[1] 19

with the changes it is much better:

system.time(mrca <- get_mrca_fast(tree, c(1,2)))
user  system elapsed
   0.012   0.000   0.012

mrca

[1] 19

And here is the code:

get_mrca_fast <- function (phy, tips) {
 rootnd <- length(phy$tip.label) + 1;
 pars <- integer(phy$Nnode);  # worst case assignment, usually far too
long
 mrcaind <- 1;  # index in pars of the mrca node. will return
highest one traversed by other lineages
 tnd <- NULL;   # node ids of tips
 if ("character" %in% class(tips)) {
 tnd <- match(tips, phy$tip.label);
 } else {
 tnd <- tips;
 }

 # build a lookup table to get parents faster
 pvec <- integer(max(phy$edge));
 pvec[phy$edge[,2]] <- phy$edge[,1];

 # get entire lineage for first tip
 nd <- tnd[1];
 for(k in 1:phy$Nnode){
 nd <- pvec[nd]
 pars[k] <- nd
 if (nd == rootnd) break
 }
 pars <- pars[1:k] #delete the rest

 # traverse lineages for remaining tips, stop if hit common ancestor
 for (i in 2:length(tnd)) {
 done <- FALSE;
 cnd <- tnd[i];
 while (!done) {
 cpar <- pvec[cnd]; # get immediate parent
 if (cpar %in% pars) {
 if (cpar == rootnd) {
 # early exit! if the mrca of any set of taxa is the
root then we are done
 return(rootnd);
 }
 idx <- which(pars == cpar);
 if (idx > mrcaind) {
 mrcaind <- idx;
 }
 done <- TRUE;
 } else {
 # keep going!
 cnd <- cpar;
 }
 }
 }
 return(pars[mrcaind]);
}


Have a nice weekend,
Klaus

On Fri, Jun 9, 2017 at 3:59 PM, Joseph W. Brown  wrote:


Liam,

I put the (updated) code up as a gist . Feel free to use/improve/whatever.
Emmanuel sees room for improvement but frankly taking 0.3 seconds to find
the MRCA of 5000 taxa on a million-tip tree is good enough for my purposes.

JWB


Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-07 Thread Joseph W. Brown
w00t!

I am aware that `which` is inefficient, but wasn't sure how to get around it in 
this instance. Thanks!

It would be great to have this in the new version!

Joseph.

Joseph W. Brown
Post-doctoral Researcher, Smith Laboratory
University of Michigan
Department of Ecology & Evolutionary Biology
Room 2071, Kraus Natural Sciences Building
Ann Arbor MI 48109-1079
josep...@umich.edu



> On 7 Jun, 2017, at 19:06, Klaus Schliep  wrote:
> 
> Hi Joseph, 
> 
> just a few changes on your code and it is actually really fast. Having 
> multiple calls to which() is often slow.  
> Replacing which() with a simple lookup table will speed your code up quite a 
> bit as you can see below. 
> This is actually how the function Ancestors() in phangorn works. 
> But let's start with the output from profiling your code and you can see that 
> alt_mrca() spends most of the time in which(). 
> 
> Rprof(tmp <- tempfile())
> nd.alt <- alt_mrca(phy, tips)
> Rprof()
> summaryRprof(tmp)
> unlink(tmp)
> 
> $by.self
> self.time self.pct total.time total.pct
> "which"  1.56  100   1.56   100
> 
> $by.total
>total.time total.pct self.time self.pct
> "which"  1.56   100  1.56  100
> "alt_mrca"   1.56   100  0.000
> 
> $sample.interval
> [1] 0.02
> 
> $sampling.time
> [1] 1.56
> 
> Now here are some timings in comparison this version with one where I 
> replaced calls to which() with a lookup table:
> 
> # ALT
> system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
>user  system elapsed 
>   1.552   0.004   1.555 
> [1] 13
> > 
> # ALT FAST
> system.time(nd.alt <- alt_mrca_fast(phy, tips)); nd.alt;
>user  system elapsed 
>   0.004   0.000   0.004 
> [1] 13 
> 
> Not too bad for some pure R code. 
> And now you really need to start thinking of another excuse for a coffee 
> break. 
> 
> Cheers, 
> Klaus
> 
> PS: Emmanuel, just another function which could get into next version. 
> PPS: The plot should better have a logarithmic scale for time.
> PPPS: Here is the improved code, just 2 additional lines.
> 
> 
> alt_mrca_fast <- function (phy, tips) {
> rootnd <- length(phy$tip.label) + 1;
> pars <- numeric(); # parents in lineage of first queried tip
> mrcaind <- 1;  # index in pars of the mrca node
> tnd <- NULL;   # node ids of tips
> if ("character" %in% class(tips)) {
> tnd <- match(tips, phy$tip.label);
> } else {
> tnd <- tips;
> }
> 
> # build a lookup table to get parents faster
> pvec <- integer(max(phy$edge))
> pvec[phy$edge[,2]] = phy$edge[,1]
> 
> # get entire lineage for first tip
> nd <- tnd[1];
> repeat {
> #nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent
> nd <- pvec[nd]
> pars <- c(pars, nd);
> if (nd == rootnd) {
> break;
> }
> }
> 
> 
> # traverse lineages for remaining tips, stop if hit common ancestor
> for (i in 2:length(tnd)) {
> done <- FALSE;
> cnd <- tnd[i];
> while (!done) {
> #cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get immediate 
> parent
> cpar <- pvec[cnd]
> if (cpar %in% pars) {
> if (cpar == rootnd) {
> # early exit! if the mrca of any set of taxa is the root 
> then we are done
> return(rootnd);
> }
> idx <- which(pars == cpar);
> if (idx > mrcaind) {
> mrcaind <- idx;
> }
> done <- TRUE;
> } else {
> # keep going!
> cnd <- cpar;
> }
> }
> }
> return(pars[mrcaind]);
> }
> 
> 
> 
> 
> On Wed, Jun 7, 2017 at 4:24 PM, Joseph W. Brown  > wrote:
> I've been working with some very large trees (tens to hundreds of thousands 
> of tips) where the speed of APE's getMRCA function has been prohibitively 
> slow. I therefore R-ified an MRCA function I developed in Java for use with 
> the Open Tree of Life tree (~3 million tips). I don't pretend to be an 
> algorithm guy, but this seems to work.
> 
> Given a tree `phy` and a set of query taxa `tips`, here is the algorithm 
> (note that the terminology has the root at the top of the tree):
> 1) For the first tip, drill up through the tree, recording its entire lineage 
> (parental nodes) to the root in a vector. This seems wasteful (i.e., if the 
> MRCA is not the root) but in practice is actually fast. Call this vector 
> `pars`.
> 
> Even computing all ancestors is actually not so slow:
> system.time(tmp <- Ancestors(phy, 1:10))
>user  system elapsed 
>   0.084   0.000   0.085 
> You need to traverse the whole tree only once, but you may run out of memory 
> for larger trees.
> 
> 
> 2) For 

Re: [R-sig-phylo] A possible alternate MRCA function to APE's getMRCA

2017-06-07 Thread Klaus Schliep
Hi Joseph,

just a few changes on your code and it is actually really fast. Having
multiple calls to which() is often slow.
Replacing which() with a simple lookup table will speed your code up quite
a bit as you can see below.
This is actually how the function Ancestors() in phangorn works.
But let's start with the output from profiling your code and you can see
that alt_mrca() spends most of the time in which().

Rprof(tmp <- tempfile())
nd.alt <- alt_mrca(phy, tips)
Rprof()
summaryRprof(tmp)
unlink(tmp)

$by.self
self.time self.pct total.time total.pct
"which"  1.56  100   1.56   100

$by.total
   total.time total.pct self.time self.pct
"which"  1.56   100  1.56  100
"alt_mrca"   1.56   100  0.000

$sample.interval
[1] 0.02

$sampling.time
[1] 1.56

Now here are some timings in comparison this version with one where I
replaced calls to which() with a lookup table:

# ALT
system.time(nd.alt <- alt_mrca(phy, tips)); nd.alt;
   user  system elapsed
  1.552   0.004   1.555
[1] 13
>
# ALT FAST
system.time(nd.alt <- alt_mrca_fast(phy, tips)); nd.alt;
   user  system elapsed
  0.004   0.000   0.004
[1] 13

Not too bad for some pure R code.
And now you really need to start thinking of another excuse for a coffee
break.

Cheers,
Klaus

PS: Emmanuel, just another function which could get into next version.
PPS: The plot should better have a logarithmic scale for time.
PPPS: Here is the improved code, just 2 additional lines.


alt_mrca_fast <- function (phy, tips) {
rootnd <- length(phy$tip.label) + 1;
pars <- numeric(); # parents in lineage of first queried tip
mrcaind <- 1;  # index in pars of the mrca node
tnd <- NULL;   # node ids of tips
if ("character" %in% class(tips)) {
tnd <- match(tips, phy$tip.label);
} else {
tnd <- tips;
}

# build a lookup table to get parents faster
pvec <- integer(max(phy$edge))
pvec[phy$edge[,2]] = phy$edge[,1]

# get entire lineage for first tip
nd <- tnd[1];
repeat {
#nd <- phy$edge[which(phy$edge[,2] == nd),1]; # get immediate parent
nd <- pvec[nd]
pars <- c(pars, nd);
if (nd == rootnd) {
break;
}
}


# traverse lineages for remaining tips, stop if hit common ancestor
for (i in 2:length(tnd)) {
done <- FALSE;
cnd <- tnd[i];
while (!done) {
#cpar <- phy$edge[which(phy$edge[,2] == cnd),1]; # get
immediate parent
cpar <- pvec[cnd]
if (cpar %in% pars) {
if (cpar == rootnd) {
# early exit! if the mrca of any set of taxa is the
root then we are done
return(rootnd);
}
idx <- which(pars == cpar);
if (idx > mrcaind) {
mrcaind <- idx;
}
done <- TRUE;
} else {
# keep going!
cnd <- cpar;
}
}
}
return(pars[mrcaind]);
}




On Wed, Jun 7, 2017 at 4:24 PM, Joseph W. Brown  wrote:

> I've been working with some very large trees (tens to hundreds of
> thousands of tips) where the speed of APE's getMRCA function has been
> prohibitively slow. I therefore R-ified an MRCA function I developed in
> Java for use with the Open Tree of Life tree (~3 million tips). I don't
> pretend to be an algorithm guy, but this seems to work.
>
> Given a tree `phy` and a set of query taxa `tips`, here is the algorithm
> (note that the terminology has the root at the top of the tree):
> 1) For the first tip, drill up through the tree, recording its entire
> lineage (parental nodes) to the root in a vector. This seems wasteful
> (i.e., if the MRCA is *not* the root) but in practice is actually fast.
> Call this vector `pars`.
>

Even computing all ancestors is actually not so slow:
system.time(tmp <- Ancestors(phy, 1:10))
   user  system elapsed
  0.084   0.000   0.085
You need to traverse the whole tree only once, but you may run out of
memory for larger trees.


2) For each subsequent tip, traverse its lineage up in the tree until
> encountering an ancestor in `pars`.
> 3) Return the highest indexed node in `pars` that was traversed by the
> other lineages.
>
> A convenient "early exit" of this approach is that if *any* of the MRCAs
> are the root then we can stop immediately. Another benefit with this is
> that the query nodes do *not* necessarily have to be terminal taxa: MRCAs
> can be found among terminal and internal nodes.
>
> So, how does it fare? Pretty good! Here are some results on a 100K tip
> tree (the tree I am actually working with has 350K tips).
>
> # set up example
> set.seed(13);
> phy <- rtree(10);
> node <- 13; # make sure MRCA is not the root
> tt <- unlist(Descendants(phy, 13));
> set.seed(13);
> tips <- sample(tt, 50);
>
> # APE
> system.time(nd.ape <-