Whoop, snes/tests/ex13.c. This is what I used for the Summit runs that I presented a while ago.
On Sun, Mar 7, 2021 at 6:12 AM Barry Smith <[email protected]> wrote: > > mat/tests/ex13.c creates a sequential AIJ matrix, converts it to the > same format, reorders it and then prints it and the reordering in ASCII. > Each of these steps is sequential and takes place on each rank. The prints > are ASCII stdout on the ranks. > > ierr = MatCreateSeqAIJ(PETSC_COMM_SELF,m*n,m*n,5,NULL,&C);CHKERRQ(ierr); > /* create the matrix for the five point stencil, YET AGAIN*/ > for (i=0; i<m; i++) { > for (j=0; j<n; j++) { > v = -1.0; Ii = j + n*i; > if (i>0) {J = Ii - n; ierr = > MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);} > if (i<m-1) {J = Ii + n; ierr = > MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);} > if (j>0) {J = Ii - 1; ierr = > MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);} > if (j<n-1) {J = Ii + 1; ierr = > MatSetValues(C,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);} > v = 4.0; ierr = > MatSetValues(C,1,&Ii,1,&Ii,&v,INSERT_VALUES);CHKERRQ(ierr); > } > } > ierr = MatAssemblyBegin(C,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr); > ierr = MatAssemblyEnd(C,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr); > > ierr = MatConvert(C,MATSAME,MAT_INITIAL_MATRIX,&A);CHKERRQ(ierr); > > ierr = MatGetOrdering(A,MATORDERINGND,&perm,&iperm);CHKERRQ(ierr); > ierr = ISView(perm,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr); > ierr = ISView(iperm,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr); > ierr = MatView(A,PETSC_VIEWER_STDOUT_SELF);CHKERRQ(ierr); > > I think each rank would simply be running the same code and dumping > everything to its own stdout. > > At some point within the system/MPI executor there is code that merges and > print outs the stdout of each rank. If the test does truly take 45 minutes > than Fugaku has a classic bug of not being able to efficiently merge stdout > from each of the ranks. Nothing really to do with PETSc, just neglect of > Fugaku developers to respect all aspects of developing a HPC system. Heck, > they only had a billion dollars, can't expect them to do what other > scalable systems do :-). > > One should be able to reproduce this with a simple MPI program that prints > a moderate amount of data to stdout on each rank. > > Barry > > > > > On Mar 6, 2021, at 9:46 PM, Mark Adams <[email protected]> wrote: > > I observed poor scaling with mat/tests/ex13 on Fugaku recently. > I was running this test as is (eg, no threads and 4 MPI processes per > node/chip, which seems recomended). I did not dig into this. > A test with about 10% of the machine took about 45 minutes to run. > Mark > > On Sat, Mar 6, 2021 at 9:49 PM Junchao Zhang <[email protected]> > wrote: > >> >> >> >> On Sat, Mar 6, 2021 at 12:27 PM Matthew Knepley <[email protected]> >> wrote: >> >>> On Fri, Mar 5, 2021 at 4:06 PM Alexei Colin <[email protected]> wrote: >>> >>>> To PETSc DMPlex users, Firedrake users, Dr. Knepley and Dr. Karpeev: >>>> >>>> Is it expected for mesh distribution step to >>>> (A) take a share of 50-99% of total time-to-solution of an FEM problem, >>>> and >>>> >>> >>> No >>> >>> >>>> (B) take an amount of time that increases with the number of ranks, and >>>> >>> >>> See below. >>> >>> >>>> (C) take an amount of memory on rank 0 that does not decrease with the >>>> number of ranks >>>> >>> >>> The problem here is that a serial mesh is being partitioned and sent to >>> all processes. This is fundamentally >>> non-scalable, but it is easy and works well for modest clusters < 100 >>> nodes or so. Above this, it will take >>> increasing amounts of time. There are a few techniques for mitigating >>> this. >>> >> Is this one-to-all communication only done once? If yes, one >> MPI_Scatterv() is enough and should not cost much. >> >> a) For simple domains, you can distribute a coarse grid, then regularly >>> refine that in parallel with DMRefine() or -dm_refine <k>. >>> These steps can be repeated easily, and redistribution in parallel >>> is fast, as shown for example in [1]. >>> >>> b) For complex meshes, you can read them in parallel, and then repeat >>> a). This is done in [1]. It is a little more involved, >>> but not much. >>> >>> c) You can do a multilevel partitioning, as they do in [2]. I cannot >>> find the paper in which they describe this right now. It is feasible, >>> but definitely the most expert approach. >>> >>> Does this make sense? >>> >>> Thanks, >>> >>> Matt >>> >>> [1] Fully Parallel Mesh I/O using PETSc DMPlex with an Application to >>> Waveform Modeling, Hapla et.al. >>> https://arxiv.org/abs/2004.08729 >>> [2] On the robustness and performance of entropy stable discontinuous >>> collocation methods for the compressible Navier-Stokes equations, ROjas . >>> et.al. >>> https://arxiv.org/abs/1911.10966 >>> >>> >>>> ? >>>> >>>> The attached plots suggest (A), (B), and (C) is happening for >>>> Cahn-Hilliard problem (from firedrake-bench repo) on a 2D 8Kx8K >>>> unit-square mesh. The implementation is here [1]. Versions are >>>> Firedrake, PyOp2: 20200204.0; PETSc 3.13.1; ParMETIS 4.0.3. >>>> >>>> Two questions, one on (A) and the other on (B)+(C): >>>> >>>> 1. Is (A) result expected? Given (A), any effort to improve the quality >>>> of the compiled assembly kernels (or anything else other than mesh >>>> distribution) appears futile since it takes 1% of end-to-end execution >>>> time, or am I missing something? >>>> >>>> 1a. Is mesh distribution fundamentally necessary for any FEM framework, >>>> or is it only needed by Firedrake? If latter, then how do other >>>> frameworks partition the mesh and execute in parallel with MPI but avoid >>>> the non-scalable mesh destribution step? >>>> >>>> 2. Results (B) and (C) suggest that the mesh distribution step does >>>> not scale. Is it a fundamental property of the mesh distribution problem >>>> that it has a central bottleneck in the master process, or is it >>>> a limitation of the current implementation in PETSc-DMPlex? >>>> >>>> 2a. Our (B) result seems to agree with Figure 4(left) of [2]. Fig 6 of >>>> [2] >>>> suggests a way to reduce the time spent on sequential bottleneck by >>>> "parallel mesh refinment" that creates high-resolution meshes from an >>>> initial coarse mesh. Is this approach implemented in DMPLex? If so, any >>>> pointers on how to try it out with Firedrake? If not, any other >>>> directions for reducing this bottleneck? >>>> >>>> 2b. Fig 6 in [3] shows plots for Assembly and Solve steps that scale >>>> well up >>>> to 96 cores -- is mesh distribution included in those times? Is anyone >>>> reading this aware of any other publications with evaluations of >>>> Firedrake that measure mesh distribution (or explain how to avoid or >>>> exclude it)? >>>> >>>> Thank you for your time and any info or tips. >>>> >>>> >>>> [1] >>>> https://github.com/ISI-apex/firedrake-bench/blob/master/cahn_hilliard/firedrake_cahn_hilliard_problem.py >>>> >>>> [2] Unstructured Overlapping Mesh Distribution in Parallel, Matthew G. >>>> Knepley, Michael Lange, Gerard J. Gorman, 2015. >>>> https://arxiv.org/pdf/1506.06194.pdf >>>> >>>> [3] Efficient mesh management in Firedrake using PETSc-DMPlex, Michael >>>> Lange, Lawrence Mitchell, Matthew G. Knepley and Gerard J. Gorman, SISC, >>>> 38(5), S143-S155, 2016. http://arxiv.org/abs/1506.07749 >>>> >>> >
