[R-sig-phylo] The Speed of Different Methods For Obtaining the Descendant Tip Taxa per Node
Hello all, I'm currently trying to measure a parameter over a large number of large trees (1700 tips), and part of this calculation requires knowing the tip taxa descended from each node (I must compare the difference in tip values among taxa descended from a node). Because I must do this many times, I decided to compare the efficiency of several methods for doing this in various R libraries with system.time() (as Liam did recently with some BM simulation functions in some recent blog posts). As I feel that others may benefit from this comparison of methods, I am posting the results to this list. Geiger's node.leaves() is much faster than the alternatives, although at ~13s a run, it is still not a particularly speedy process. I didn't need the actual tip.labels, so I took node.leaves() and made it as lean as possible, to produce node.tips(), below. That cut the run time down to ~9 sec. node.tips-function (phy, node){ n - length(phy$tip.label) if (node = n){node}else{ l- numeric() d- phy$edge[which(phy$edge[,1]==node),2] for(j in d){if(j = n){l - c(l, j)}else{l-c(l, node.tips(phy,j))}} l}} If anyone knows of another alternative that might be faster, I would greatly appreciate any suggestions. -Dave Bapst, UChicago Geosci Using the modified node.leaves() function from geiger, node.tips() system.time(desc-sapply(edge_end,function(x) node.tips(res_tree,x))) user system elapsed 9.390.009.44 Using node.leaves() in geiger system.time(desc-sapply(edge_end,function(x) node.leaves(res_tree,x))) user system elapsed 13.290.02 13.34 Using Descendants() in phangorn system.time(desc-sapply(edge_end,function(x) Descendants(res_tree,x))) user system elapsed 75.600.10 76.83 Using listTips() in adephylo system.time(desc_list-c(as.list(1:Ntip(res_tree)),listTips(res_tree))) user system elapsed 75.272.93 87.28 Using descendants() in phylobase system.time(desc-sapply(edge_end,function(x) descendants(res_tree4,x,type=tips))) user system elapsed 149.560.67 155.15 Using nodeDecendants() in maticce (note that translating the tree into ouchtree format was prohibitively very lengthy) system.time(res_tree_ou-ape2ouch(res_tree)) user system elapsed 84.010.13 85.34 system.time(desc-sapply(edge_end,function(x) nodeDescendents(res_tree_ou,x))) user system elapsed 65.910.23 68.47 -- David Bapst Dept of Geophysical Sciences University of Chicago 5734 S. Ellis Chicago, IL 60637 http://home.uchicago.edu/~dwbapst/ ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Dealing with Bounded Trait Measures
Marguerite and others- I had missed that Scales et al paper. Thank for pointing it out. I have also had to deal with variables that were bounded by 0 and 1 previously, and the use of the logit is something I'll have to try when I go back to those analyses. Although I am interested in transformations of such traits for comparative analyses in general (measuring correlated trait evolution, model fitting of OU, measurements of rate, etc.), I am mainly interested in measurements of phylogenetic signal. Is there any reason to think the kludge (adding an arbitrarily small value to trait values) would bias a test for phylogenetic signal (i.e. do the trait show a pattern of inheritance)? I'm worried that measurements of signal might be prone to odd results with the kludge, because the data structure would still be quite different from BM expectation. Perhaps, the threshold model might be of a lot of use here; I'm going to have to look into that much more. Thanks for you advice, all! -Dave On Sun, Mar 6, 2011 at 12:50 AM, Marguerite Butler mbut...@hawaii.edu wrote: Hi David, Liam and everyone, Reflecting traits at boundaries or absorbing them is something that can be done, but I guess I'd like to encourage everyone to think carefully about the interpretation of such simulations. What are you trying to model and what does it mean at the end? Doing these boundary manipulations produces odd shaped-distributions of traits at the end of the simulated process. But probably more importantly, does this a good model for the biological process which might be occurring? If so, then great. But I would guess not with traits such as spine length. The zero issue is a hard problem. Is it a combination of a discrete trait (presence/absence) and a continuous trait (length)? Or perhaps a better model would be a threshold trait with underlying continuous variation but below some boundary it will not be expressed? Has anyone developed a model that can be used for such a scenario? In the absence of such a model, I've done things like what Enrico suggests, with a kludge solution like adding a very small value to all values to avoid the zero. This doesn't make much difference to the analysis, and avoids the singularity at log(0). If you have a trait that is a frequency, you can use the logit() function instead of log(), as we did in Scales et al 2009. Scales J.A., King A.A., and Butler M.A. (2009) Running for your life or running for your dinner: What drives fiber-type evolution in lizard locomotor muscles? Am. Nat.173:543-553. Marguerite On Mar 5, 2011, at 11:06 AM, Liam J. Revell wrote: I'm not sure that this is what Dave has in mind, but if anyone is interested in simulating bounded evolution in R, I just added it to my fastBM() function (code here: http://anolis.oeb.harvard.edu/~liam/R-phylogenetics/fastBM/v0.3/fastBM.R). In the process of evolving traits up the tree, I just bounce back any phenotypes that exceed the lower or upper boundary conditions specified by the user (by default they are -Inf and Inf). I think I did this properly. Feedback welcome though. - Liam -- Liam J. Revell University of Massachusetts Boston web: http://faculty.umb.edu/liam.revell/ (new) email: liam.rev...@umb.edu (new) blog: http://phytools.blogspot.com On 3/5/2011 12:55 PM, tgarl...@ucr.edu wrote: Hello David, Enrico, et al., I may have lost track of what Dave was originally trying to do, and I am not familiar with all of the options presently available in r for simulating continuously valued traits along a specified phylogenetic tree. However, I wanted to point out that MANY possibilities, including trends, the OU process, and actual limits to trait evolution implemented in several ways, are available in our original DOS program PDSIMUL.EXE that accompanies this paper: Garland, T., Jr., A. W. Dickerman, C. M. Janis, and J. A. Jones. 1993. Phylogenetic analysis of covariance by computer simulation. Systematic Biology 42:265-292. It has been used many times to look at trends, limits, etc., e.g., in these papers: Díaz-Uriarte, R., and T. Garland, Jr. 1996. Testing hypotheses of correlated evolution using phylogenetically independent contrasts: sensitivity to deviations from Brownian motion. Systematic Biology 45:27-47. Laurin, M. 2010. Assessment of the relative merits of a few methods to detect evolutionary trends. Syst. Biol. 59:689-704. Cheers, Ted Theodore Garland, Jr. Professor Department of Biology University of California, Riverside Riverside, CA 92521 Office Phone: (951) 827-3524 Lab Phone: (951) 827-5724 Home Phone: (951) 328-0820 Facsimile: (951) 827-4286 = Dept. office (not confidential) Email: tgarl...@ucr.edu Main Departmental page: http://www.biology.ucr.edu/people/faculty/Garland.html List of all Publications: http://www.biology.ucr.edu/people/faculty/Garland/GarlandPublications.html Garland and
Re: [R-sig-phylo] The Speed of Different Methods For Obtaining the Descendant Tip Taxa per Node
Hi Dave. It seems like one way to get these values faster would be to count the number of descendants as the tree is read in to R. This is possible because when the ) character is reached by the Newick parser, all descendant nodes (and tips) have already been created in memory. This was simple to add to my tree reading function read.newick() [code here: http://anolis.oeb.harvard.edu/~liam/R-phylogenetics/read.newick/v0.3/read.newick.R]. Now v0.3 of the function returns a modified phylo object with the vector $Ndesc containing the number of descendants from each internal and terminal node above the root (in the order of $edge[,2]). This is exactly the same vector as would be obtained by the following commands: desc-sapply(tree$edge[,2],function(x) node.leaves(tree,x)) Ndesc-sapply(desc,function(x) length(x)) Note that read.newick() only reads one tree at a time in phylip format (and without node labels) - but reading multiple trees or node labels, or reading from nexus format, would quite easy to add if you find that this is indeed faster. In addition, I believe that newick2phylog() creates this list in reading a Newick tree into memory as a phylog object. This is stored as phy$Aparam$nlea for phylog object phy. Unfortunately, it seems like newick2phylog() is very slow for large trees. - Liam -- Liam J. Revell University of Massachusetts Boston web: http://faculty.umb.edu/liam.revell/ email: liam.rev...@umb.edu blog: http://phytools.blogspot.com On 3/6/2011 12:47 PM, David Bapst wrote: Hello all, I'm currently trying to measure a parameter over a large number of large trees (1700 tips), and part of this calculation requires knowing the tip taxa descended from each node (I must compare the difference in tip values among taxa descended from a node). Because I must do this many times, I decided to compare the efficiency of several methods for doing this in various R libraries with system.time() (as Liam did recently with some BM simulation functions in some recent blog posts). As I feel that others may benefit from this comparison of methods, I am posting the results to this list. Geiger's node.leaves() is much faster than the alternatives, although at ~13s a run, it is still not a particularly speedy process. I didn't need the actual tip.labels, so I took node.leaves() and made it as lean as possible, to produce node.tips(), below. That cut the run time down to ~9 sec. node.tips-function (phy, node){ n- length(phy$tip.label) if (node= n){node}else{ l- numeric() d- phy$edge[which(phy$edge[,1]==node),2] for(j in d){if(j= n){l- c(l, j)}else{l-c(l, node.tips(phy,j))}} l}} If anyone knows of another alternative that might be faster, I would greatly appreciate any suggestions. -Dave Bapst, UChicago Geosci Using the modified node.leaves() function from geiger, node.tips() system.time(desc-sapply(edge_end,function(x) node.tips(res_tree,x))) user system elapsed 9.390.009.44 Using node.leaves() in geiger system.time(desc-sapply(edge_end,function(x) node.leaves(res_tree,x))) user system elapsed 13.290.02 13.34 Using Descendants() in phangorn system.time(desc-sapply(edge_end,function(x) Descendants(res_tree,x))) user system elapsed 75.600.10 76.83 Using listTips() in adephylo system.time(desc_list-c(as.list(1:Ntip(res_tree)),listTips(res_tree))) user system elapsed 75.272.93 87.28 Using descendants() in phylobase system.time(desc-sapply(edge_end,function(x) descendants(res_tree4,x,type=tips))) user system elapsed 149.560.67 155.15 Using nodeDecendants() in maticce (note that translating the tree into ouchtree format was prohibitively very lengthy) system.time(res_tree_ou-ape2ouch(res_tree)) user system elapsed 84.010.13 85.34 system.time(desc-sapply(edge_end,function(x) nodeDescendents(res_tree_ou,x))) user system elapsed 65.910.23 68.47 ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] The Speed of Different Methods For Obtaining the Descendant Tip Taxa per Node
Klaus- Oh, that worked rather splendid! Thanks for letting me know. system.time(desc-Descendants(res_tree,edge_end)) user system elapsed 1.560.001.56 -Dave On Sun, Mar 6, 2011 at 3:22 PM, Klaus Schliep klaus.schl...@gmail.com wrote: Hi David, you can supply Descendents (from phangorn) with a vector instead of using sapply. I should have mentioned this in the help file. If your vector (edge_end) is long enough (I don't have a exact number here) this can be much faster than using sapply as some results are reused. Here are some timings: set.seed(123) res_tree = rtree(1700) edge_end = unique(res_tree$edge[,1]) system.time(desc1-sapply(edge_end,function(x) Descendants(res_tree,x))) user system elapsed 54.51 0.00 54.87 system.time(desc2-Descendants(res_tree,edge_end)) user system elapsed 0.75 0.00 0.75 system.time(desc3-sapply(edge_end,function(x) node.tips(res_tree,x))) user system elapsed 4.07 0.00 4.07 Cheers, Klaus On 3/6/11, David Bapst dwba...@uchicago.edu wrote: Hello all, I'm currently trying to measure a parameter over a large number of large trees (1700 tips), and part of this calculation requires knowing the tip taxa descended from each node (I must compare the difference in tip values among taxa descended from a node). Because I must do this many times, I decided to compare the efficiency of several methods for doing this in various R libraries with system.time() (as Liam did recently with some BM simulation functions in some recent blog posts). As I feel that others may benefit from this comparison of methods, I am posting the results to this list. Geiger's node.leaves() is much faster than the alternatives, although at ~13s a run, it is still not a particularly speedy process. I didn't need the actual tip.labels, so I took node.leaves() and made it as lean as possible, to produce node.tips(), below. That cut the run time down to ~9 sec. node.tips-function (phy, node){ n - length(phy$tip.label) if (node = n){node}else{ l- numeric() d- phy$edge[which(phy$edge[,1]==node),2] for(j in d){if(j = n){l - c(l, j)}else{l-c(l, node.tips(phy,j))}} l}} If anyone knows of another alternative that might be faster, I would greatly appreciate any suggestions. -Dave Bapst, UChicago Geosci Using the modified node.leaves() function from geiger, node.tips() system.time(desc-sapply(edge_end,function(x) node.tips(res_tree,x))) user system elapsed 9.39 0.00 9.44 Using node.leaves() in geiger system.time(desc-sapply(edge_end,function(x) node.leaves(res_tree,x))) user system elapsed 13.29 0.02 13.34 Using Descendants() in phangorn system.time(desc-sapply(edge_end,function(x) Descendants(res_tree,x))) user system elapsed 75.60 0.10 76.83 Using listTips() in adephylo system.time(desc_list-c(as.list(1:Ntip(res_tree)),listTips(res_tree))) user system elapsed 75.27 2.93 87.28 Using descendants() in phylobase system.time(desc-sapply(edge_end,function(x) descendants(res_tree4,x,type=tips))) user system elapsed 149.56 0.67 155.15 Using nodeDecendants() in maticce (note that translating the tree into ouchtree format was prohibitively very lengthy) system.time(res_tree_ou-ape2ouch(res_tree)) user system elapsed 84.01 0.13 85.34 system.time(desc-sapply(edge_end,function(x) nodeDescendents(res_tree_ou,x))) user system elapsed 65.91 0.23 68.47 -- David Bapst Dept of Geophysical Sciences University of Chicago 5734 S. Ellis Chicago, IL 60637 http://home.uchicago.edu/~dwbapst/ ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo -- Klaus Schliep Université Paris 6 (Pierre et Marie Curie) 9, Quai Saint-Bernard, 75005 Paris -- David Bapst Dept of Geophysical Sciences University of Chicago 5734 S. Ellis Chicago, IL 60637 http://home.uchicago.edu/~dwbapst/ ___ R-sig-phylo mailing list R-sig-phylo@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Re: [R-sig-phylo] Dealing with Bounded Trait Measures
Hi David, I am not sure what you mean. If you add a tiny value to all of your data, then all you've done is shifted the mean ever so slightly. You then log-transform to put the variables on a scale unbounded by 0, and then you measure phylogenetic signal. The BM process models evolutionary changes as random draws from a normal distribution. So this would be the log-transformed values here. Adding a tiny value to all data doesn't change things very much. Log transformation changes things a lot more. So you're modeling a BM process on the log-transformed values. But there is nothing illegal about that. You do have to think about it, however. This makes a lot of sense to me when I'm modeling body size evolution, because the model would imply that when you're small, evolutionary changes tend to be smaller than when you're big (evolutionary changes on larger animals would tend to have bigger jumps on the original scale). Ignoring any clade-specific biases, evolutionary changes in the body size of mice would tend to be a lot smaller than evolutionary changes in the size of horses, for example. We can imagine that some simple mechanistic reasons such as their subunits are smaller (they have smaller cells, smaller vertebrae, etc., so adding on or taking away would have bigger size effects). This seems biologically reasonable. I am not sure if it would apply to your spine size data, but maybe it does. Of course if your size variation is not that great, then it doesn't much matter (because log-trasformation will then not have much effect on compressing the scale of the values). Anyway, this may be a longer-winded comment than you were asking for. I can't think of any reason why adding tiny values would do anything to the evolutionary simulation. Marguerite On Mar 6, 2011, at 9:53 AM, David Bapst wrote: Marguerite and others- I had missed that Scales et al paper. Thank for pointing it out. I have also had to deal with variables that were bounded by 0 and 1 previously, and the use of the logit is something I'll have to try when I go back to those analyses. Although I am interested in transformations of such traits for comparative analyses in general (measuring correlated trait evolution, model fitting of OU, measurements of rate, etc.), I am mainly interested in measurements of phylogenetic signal. Is there any reason to think the kludge (adding an arbitrarily small value to trait values) would bias a test for phylogenetic signal (i.e. do the trait show a pattern of inheritance)? I'm worried that measurements of signal might be prone to odd results with the kludge, because the data structure would still be quite different from BM expectation. Perhaps, the threshold model might be of a lot of use here; I'm going to have to look into that much more. Thanks for you advice, all! -Dave On Sun, Mar 6, 2011 at 12:50 AM, Marguerite Butler mbut...@hawaii.edu wrote: Hi David, Liam and everyone, Reflecting traits at boundaries or absorbing them is something that can be done, but I guess I'd like to encourage everyone to think carefully about the interpretation of such simulations. What are you trying to model and what does it mean at the end? Doing these boundary manipulations produces odd shaped-distributions of traits at the end of the simulated process. But probably more importantly, does this a good model for the biological process which might be occurring? If so, then great. But I would guess not with traits such as spine length. The zero issue is a hard problem. Is it a combination of a discrete trait (presence/absence) and a continuous trait (length)? Or perhaps a better model would be a threshold trait with underlying continuous variation but below some boundary it will not be expressed? Has anyone developed a model that can be used for such a scenario? In the absence of such a model, I've done things like what Enrico suggests, with a kludge solution like adding a very small value to all values to avoid the zero. This doesn't make much difference to the analysis, and avoids the singularity at log(0). If you have a trait that is a frequency, you can use the logit() function instead of log(), as we did in Scales et al 2009. Scales J.A., King A.A., and Butler M.A. (2009) Running for your life or running for your dinner: What drives fiber-type evolution in lizard locomotor muscles? Am. Nat.173:543-553. Marguerite On Mar 5, 2011, at 11:06 AM, Liam J. Revell wrote: I'm not sure that this is what Dave has in mind, but if anyone is interested in simulating bounded evolution in R, I just added it to my fastBM() function (code here: http://anolis.oeb.harvard.edu/~liam/R-phylogenetics/fastBM/v0.3/fastBM.R). In the process of evolving traits up the tree, I just bounce back any phenotypes that exceed the lower or upper boundary conditions specified by the user (by default they are -Inf and Inf). I think I
Re: [R-sig-phylo] The Speed of Different Methods For Obtaining the Descendant Tip Taxa per Node
Hi David and others, prop.part() with a single tree does what you want: set.seed(123) res_tree - rtree(1700) system.time(desc2 - prop.part(res_tree)) user system elapsed 0.050 0.000 0.053 edge_end - unique(res_tree$edge[,1]) system.time(desc1 - Descendants(res_tree,edge_end)) user system elapsed 0.400 0.000 0.406 This gives the same answer than Descendants: for (i in seq_along(desc1)) + stopifnot(all(sort(desc1[[i]]) == sort(desc2[[i]]))) Klaus makes an excellent reminder about R programming: use vectorisation whenever possible (you'll almost always win to do it). Your examples are very instructive. One lesson is that it is quite easy to write R code that tends to be slow. Another one is that it is also easy to write different functions doing the same thing. I'm quite impressed to see that several packages have re-implemented something that has been in ape for some time. I wish developpers (and users as well) could use these tools from ape which have been developed specifically for this kind of objectives (fast, efficient, and reliable). Is it an issue of 'visibility' of these functions? This could be a project idea for the next GSoC. Cheers, Emmanuel PS: if you want to speed-up your computations and you have a multi-core processor on your machine (which are common on laptops nowadays), you could try the multicore package (phangorn already supports it): library(multicore) job - parallel(prop.part(res_tree)) system.time(out - collect(job)) user system elapsed 0.080 0.010 0.002 out - out[[1]] identical(out, desc2) [1] TRUE This might be even more interesting with lists of trees. David Bapst wrote on 07/03/2011 04:32: Klaus- Oh, that worked rather splendid! Thanks for letting me know. system.time(desc-Descendants(res_tree,edge_end)) user system elapsed 1.560.001.56 -Dave On Sun, Mar 6, 2011 at 3:22 PM, Klaus Schliep klaus.schl...@gmail.com wrote: Hi David, you can supply Descendents (from phangorn) with a vector instead of using sapply. I should have mentioned this in the help file. If your vector (edge_end) is long enough (I don't have a exact number here) this can be much faster than using sapply as some results are reused. Here are some timings: set.seed(123) res_tree = rtree(1700) edge_end = unique(res_tree$edge[,1]) system.time(desc1-sapply(edge_end,function(x) Descendants(res_tree,x))) user system elapsed 54.510.00 54.87 system.time(desc2-Descendants(res_tree,edge_end)) user system elapsed 0.750.000.75 system.time(desc3-sapply(edge_end,function(x) node.tips(res_tree,x))) user system elapsed 4.070.004.07 Cheers, Klaus On 3/6/11, David Bapst dwba...@uchicago.edu wrote: Hello all, I'm currently trying to measure a parameter over a large number of large trees (1700 tips), and part of this calculation requires knowing the tip taxa descended from each node (I must compare the difference in tip values among taxa descended from a node). Because I must do this many times, I decided to compare the efficiency of several methods for doing this in various R libraries with system.time() (as Liam did recently with some BM simulation functions in some recent blog posts). As I feel that others may benefit from this comparison of methods, I am posting the results to this list. Geiger's node.leaves() is much faster than the alternatives, although at ~13s a run, it is still not a particularly speedy process. I didn't need the actual tip.labels, so I took node.leaves() and made it as lean as possible, to produce node.tips(), below. That cut the run time down to ~9 sec. node.tips-function (phy, node){ n - length(phy$tip.label) if (node = n){node}else{ l- numeric() d- phy$edge[which(phy$edge[,1]==node),2] for(j in d){if(j = n){l - c(l, j)}else{l-c(l, node.tips(phy,j))}} l}} If anyone knows of another alternative that might be faster, I would greatly appreciate any suggestions. -Dave Bapst, UChicago Geosci Using the modified node.leaves() function from geiger, node.tips() system.time(desc-sapply(edge_end,function(x) node.tips(res_tree,x))) user system elapsed 9.390.009.44 Using node.leaves() in geiger system.time(desc-sapply(edge_end,function(x) node.leaves(res_tree,x))) user system elapsed 13.290.02 13.34 Using Descendants() in phangorn system.time(desc-sapply(edge_end,function(x) Descendants(res_tree,x))) user system elapsed 75.600.10 76.83 Using listTips() in adephylo system.time(desc_list-c(as.list(1:Ntip(res_tree)),listTips(res_tree))) user system elapsed 75.272.93 87.28 Using descendants() in phylobase system.time(desc-sapply(edge_end,function(x) descendants(res_tree4,x,type=tips))) user system elapsed 149.560.67 155.15 Using nodeDecendants() in maticce (note that translating the tree into ouchtree format was prohibitively very lengthy)