Re: [R-sig-phylo] Parallelization in ape::dist.topo

Klaus Schliep Tue, 07 Mar 2023 04:01:48 -0800

Dear Vojtěch,
nice work. Just a few random comments:
Parallelization is often not straightforward as it depends on the hardware
and the operating system. My preference is using the future package for
parallelization as it does some nice abstraction for the different R
packages, so you can try different things and no need to write different
versions of the code. chatGPT seems to have missed this.
Defaulting to "cores = detectCores()" is not a good idea. The CRAN
Repository Policy (https://cran.r-project.org/web/packages/policies.html)
states: "If running a package uses multiple threads/cores it must never use
more than two simultaneously: the check farm is a shared resource and will
typically be running many checks simultaneously." In short such a default
can cause serious problems on a cluster and I got properly ripleyed for
this years ago, but that's another story. So the default should be
something like min(2L, detectCores()). So with great power comes great
responsibility and users should read the man pages.
Kind regards,
Klaus



On Mon, Mar 6, 2023 at 2:43 PM Vojtěch Zeisek <vo...@trapa.cz> wrote:

> Hello dear colleagues,
> I use often ape::dist.topo (see here dist.topo.r), which is doing the
> calculations sequentially, which is very slow for large data sets.


What is large? Large number of trees or large trees?

I'm sorry,
> I haven't found any relevant Git repository or so, so I hope Emmanuel
> won't
> mind if I discuss it here.
> I discussed various options with ChatGPT and dist.topo.par1.r is the
> simplest
> solution, basically using mc.lapply instead of 2 for loops. Good study
> material for how to do it in general. Little enhancements are in
> dist.topo.par2.r, which should be slightly better in case some pair of
> comparisons would return NA or so, but from my tests there doesn't seem to
> be
> any difference.
>

I think there is room for improvement with preprocessing trees. If you have
e.g. bootstrap trees with short edges you might want to use di2multi()
first to avoid spurious differences. Filtering out duplicated trees could
also speed up things, if there are many. This will depend on the trees you
want to compare and the methods.

And finally there is dist.topo.par3.r which doesn't load parallel (and uses
> plain lapply) for cores==1, while parallel and doParallel for multiple
> cores.
> It also contains some checks and error handling. From my testing it works
> well. I'm not sure if tryCatch is really needed there.


I am not sure if it is necessary, but you don't want to have it in the
inner loop ;)

In any case,
> improvements welcomed. :-)
> So, what do You think? Is this usable improvement of ape::dist.topo?
> Sincerely,
> V.
>
> --
> Vojtěch Zeisek
> https://trapa.cz/en/
>
> Department of Botany, Faculty of Science
> Charles University, Prague, Czech Republic
> https://www.natur.cuni.cz/biology/botany/
> https://lab-allience.natur.cuni.cz/
>
> Institute of Botany, Czech Academy of Sciences
> Průhonice, Czech Republic
> https://www.ibot.cas.cz/en/
> Computing cluster
> https://sorbus.ibot.cas.cz/en/start
> _______________________________________________
> R-sig-phylo mailing list - R-sig-phylo@r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
> Searchable archive at
> http://www.mail-archive.com/r-sig-phylo@r-project.org/
>


-- 
Klaus Schliep

Senior Scientist
Institute of Molecular Biotechnology
TU Graz
https://www.imbt.tugraz.at

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Parallelization in ape::dist.topo

Reply via email to