Re: [R-sig-phylo] Parallelization in ape::dist.topo

Vojtěch Zeisek Tue, 07 Mar 2023 04:49:23 -0800

Hello,
thank You for Your comments, dear Klaus.

Dne úterý 7. března 2023 13:01:07 CET, Klaus Schliep napsal(a):
> Dear Vojtěch,
> nice work. Just a few random comments:
> Parallelization is often not straightforward as it
> depends on the hardware and the operating system.


Yes. E.g. it typically relies on fork(), which is unavailable on Windows.

> My preference is using the future package for parallelization
> as it does some nice abstraction for the different R packages,
> so you can try different things and no need to write different
> versions of the code. chatGPT seems to have missed this.

I was looking at future.apply package, but it seemed unnecessarily complex to 
me. I'm bit biased as I hardly meet anyone with working parallelization 
(greetings to MPI) on Windows and on Linux "plan" mostly works well.

> Defaulting to "cores = detectCores()" is not a good idea. The CRAN
> Repository Policy (https://cran.r-project.org/web/packages/policies.html)
> states: "If running a package uses multiple threads/cores it must never
> use more than two simultaneously: the check farm is a shared resource
> and will typically be running many checks simultaneously." In short such
> a default can cause serious problems on a cluster and I got properly
> ripleyed for this years ago, but that's another story. So the default
> should be something like min(2L, detectCores()).

That's good point, thank You.

> So with great power comes great responsibility
> and users should read the man pages.
> Kind regards,
> Klaus
> 
> On Mon, Mar 6, 2023 at 2:43 PM Vojtěch Zeisek wrote:
> > Hello dear colleagues,
> > I use often ape::dist.topo (see here dist.topo.r), which is doing the
> > calculations sequentially, which is very slow for large data sets.
> 
> What is large? Large number of trees or large trees?

Let's say up to ~1000 trees with 100+ tips. Then dist.topo runs for hours.

> > I'm sorry, I haven't found any relevant Git repository or so, so
> > I hope Emmanuel won't mind if I discuss it here. I discussed
> > various options with ChatGPT and dist.topo.par1.r is the simplest
> > solution, basically using mc.lapply instead of 2 for loops. Good
> > study material for how to do it in general. Little enhancements
> > are in dist.topo.par2.r, which should be slightly better in case
> > some pair of comparisons would return NA or so, but from my tests
> > there doesn't seem to beany difference.
> 
> I think there is room for improvement with preprocessing trees. If
> you have e.g. bootstrap trees with short edges you might want to
> use di2multi() first to avoid spurious differences. Filtering out
> duplicated trees could also speed up things, if there are many.
> This will depend on the trees you want to compare and the methods.

Could be, but not every distance method works well with polytomies.

> > And finally there is dist.topo.par3.r which doesn't load parallel
> > (and uses plain lapply) for cores==1, while parallel and doParallel
> > for multiple cores. It also contains some checks and error handling.
> > From my testing it works well. I'm not sure if tryCatch is really
> > needed there.
> 
> I am not sure if it is necessary, but you
> don't want to have it in the inner loop ;)

It's not first time I see such construction, I already used it like that (also 
in different cases) and so far so good. :-)

> > In any case, improvements welcomed. :-)
> > So, what do You think? Is this usable improvement of ape::dist.topo?

Thank You,
V.

-- 
Vojtěch Zeisek
https://trapa.cz/en/

Department of Botany, Faculty of Science
Charles University, Prague, Czech Republic
https://www.natur.cuni.cz/biology/botany/
https://lab-allience.natur.cuni.cz/

Institute of Botany, Czech Academy of Sciences
Průhonice, Czech Republic
https://www.ibot.cas.cz/en/
Computing cluster
https://sorbus.ibot.cas.cz/en/start

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
R-sig-phylo mailing list - R-sig-phylo@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-phylo
Searchable archive at http://www.mail-archive.com/r-sig-phylo@r-project.org/

Re: [R-sig-phylo] Parallelization in ape::dist.topo

Reply via email to