I observed poor scaling with mat/tests/ex13 on Fugaku recently. I was running this test as is (eg, no threads and 4 MPI processes per node/chip, which seems recomended). I did not dig into this. A test with about 10% of the machine took about 45 minutes to run. Mark
On Sat, Mar 6, 2021 at 9:49 PM Junchao Zhang <[email protected]> wrote: > > > > On Sat, Mar 6, 2021 at 12:27 PM Matthew Knepley <[email protected]> > wrote: > >> On Fri, Mar 5, 2021 at 4:06 PM Alexei Colin <[email protected]> wrote: >> >>> To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev: >>> >>> Is it expected for mesh distribution step to >>> (A) take a share of 50-99% of total time-to-solution of an FEM problem, >>> and >>> >> >> No >> >> >>> (B) take an amount of time that increases with the number of ranks, and >>> >> >> See below. >> >> >>> (C) take an amount of memory on rank 0 that does not decrease with the >>> number of ranks >>> >> >> The problem here is that a serial mesh is being partitioned and sent to >> all processes. This is fundamentally >> non-scalable, but it is easy and works well for modest clusters < 100 >> nodes or so. Above this, it will take >> increasing amounts of time. There are a few techniques for mitigating >> this. >> > Is this one-to-all communication only done once? If yes, one > MPI_Scatterv() is enough and should not cost much. > > a) For simple domains, you can distribute a coarse grid, then regularly >> refine that in parallel with DMRefine() or -dm_refine <k>. >> These steps can be repeated easily, and redistribution in parallel is >> fast, as shown for example in [1]. >> >> b) For complex meshes, you can read them in parallel, and then repeat a). >> This is done in [1]. It is a little more involved, >> but not much. >> >> c) You can do a multilevel partitioning, as they do in [2]. I cannot find >> the paper in which they describe this right now. It is feasible, >> but definitely the most expert approach. >> >> Does this make sense? >> >> Thanks, >> >> Matt >> >> [1] Fully Parallel Mesh I/O using PETSc DMPlex with an Application to >> Waveform Modeling, Hapla et.al. >> https://arxiv.org/abs/2004.08729 >> [2] On the robustness and performance of entropy stable discontinuous >> collocation methods for the compressible Navier-Stokes equations, ROjas . >> et.al. >> https://arxiv.org/abs/1911.10966 >> >> >>> ? >>> >>> The attached plots suggest (A), (B), and (C) is happening for >>> Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K >>> unit-square mesh. The implementation is here [1]. Versions are >>> Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3. >>> >>> Two questions, one on (A) and the other on (B)+(C): >>> >>> 1. Is (A) result expected? Given (A), any effort to improve the quality >>> of the compiled assembly kernels (or anything else other than mesh >>> distribution) appears futile since it takes 1% of end-to-end execution >>> time, or am I missing something? >>> >>> 1a. Is mesh distribution fundamentally necessary for any FEM framework, >>> or is it only needed by Firedrake? If latter, then how do other >>> frameworks partition the mesh and execute in parallel with MPI but avoid >>> the non-scalable mesh destribution step? >>> >>> 2. Results (B) and (C) suggest that the mesh distribution step does >>> not scale. Is it a fundamental property of the mesh distribution problem >>> that it has a central bottleneck in the master process, or is it >>> a limitation of the current implementation in PETSc-DMPlex? >>> >>> 2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of >>> [2] >>> suggests a way to reduce the time spent on sequential bottleneck by >>> "parallel mesh refinment" that creates high-resolution meshes from an >>> initial coarse mesh. Is this approach implemented in DMPLex? If so, any >>> pointers on how to try it out with Firedrake? If not, any other >>> directions for reducing this bottleneck? >>> >>> 2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale >>> well up >>> to 96 cores -- is mesh distribution included in those times? Is anyone >>> reading this aware of any other publications with evaluations of >>> Firedrake that measure mesh distribution (or explain how to avoid or >>> exclude it)? >>> >>> Thank you for your time and any info or tips. >>> >>> >>> [1] >>> https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py >>> >>> [2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G. >>> Knepley, Michael Lange, Gerard J. Gorman, 2015. >>> https://arxiv.org/pdf/1506.06194.pdf >>> >>> [3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael >>> Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC, >>> 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749 >>> >>
