[2] On the robustness and performance of entropy stable discontinuous > collocation methods for the compressible Navier-Stokes equations, ROjas . > et.al. > https://arxiv.org/abs/1911.10966 >
This is not the proper reference, here is the correct one https://www.sciencedirect.com/science/article/pii/S0021999120306185?dgcid=rss_sd_all However, there the algorithm is only outlined, and performances related to the mesh distribution are not really reported. We observed a large gain for large core counts and one to all distributions (from minutes to seconds) by splitting the several communication rounds needed by DMPlex into stages: from rank 0 to 1 rank per node, and then decomposing independently within the node. Attached the total time for one-to-all DMPlexDistrbute for a 128^3 mesh > > >> ? >> >> The attached plots suggest (A), (B), and (C) is happening for >> Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K >> unit-square mesh. The implementation is here [1]. Versions are >> Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3. >> >> Two questions, one on (A) and the other on (B)+(C): >> >> 1. Is (A) result expected? Given (A), any effort to improve the quality >> of the compiled assembly kernels (or anything else other than mesh >> distribution) appears futile since it takes 1% of end-to-end execution >> time, or am I missing something? >> >> 1a. Is mesh distribution fundamentally necessary for any FEM framework, >> or is it only needed by Firedrake? If latter, then how do other >> frameworks partition the mesh and execute in parallel with MPI but avoid >> the non-scalable mesh destribution step? >> >> 2. Results (B) and (C) suggest that the mesh distribution step does >> not scale. Is it a fundamental property of the mesh distribution problem >> that it has a central bottleneck in the master process, or is it >> a limitation of the current implementation in PETSc-DMPlex? >> >> 2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of [2] >> suggests a way to reduce the time spent on sequential bottleneck by >> "parallel mesh refinment" that creates high-resolution meshes from an >> initial coarse mesh. Is this approach implemented in DMPLex? If so, any >> pointers on how to try it out with Firedrake? If not, any other >> directions for reducing this bottleneck? >> >> 2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale well >> up >> to 96 cores -- is mesh distribution included in those times? Is anyone >> reading this aware of any other publications with evaluations of >> Firedrake that measure mesh distribution (or explain how to avoid or >> exclude it)? >> >> Thank you for your time and any info or tips. >> >> >> [1] >> https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py >> >> [2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G. >> Knepley, Michael Lange, Gerard J. Gorman, 2015. >> https://arxiv.org/pdf/1506.06194.pdf >> >> [3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael >> Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC, >> 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749 >> > -- Stefano
