On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com> wrote:
> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev> wrote: > >> >> Mark, >> >> Looks like the error checking in PetscCommDuplicate() is doing its job. >> It is reporting an attempt to use an PETSc object constructer on a subset >> of ranks of an MPI_Comm (which is, of course, fundamentally impossible in >> the PETSc/MPI model) >> >> Note that nroots can be negative on a particular rank but >> DMPlexLabelComplete_Internal() is collective on sf based on the comment in >> the code below >> >> >> struct _p_PetscSF { >> .... >> PetscInt nroots; /* Number of root vertices on current process >> (candidates for incoming edges) */ >> >> But the next routine calls a collective only when nroots >= 0 >> >> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, >> PetscBool completeCells){ >> ... >> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL)); >> if (nroots >= 0) { >> DMLabel lblRoots, lblLeaves; >> IS valueIS, pointIS; >> const PetscInt *values; >> PetscInt numValues, v; >> >> /* Pull point contributions from remote leaves into local roots */ >> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves)); >> >> >> The code is four years old? How come this problem of calling the >> constructure on a subset of ranks hasn't come up since day 1? >> > > The contract here is that it should be impossible to have nroots < 0 > (meaning the SF is not setup) on a subset of processes. Do we know that > this is happening? > Can't imagine a code bug here. Very simple code. This code does use GAMG as the coarse grid solver in a pretty extreme way. GAMG is fairly complicated and not used on such small problems with high parallelism. It is conceivable that its a GAMG bug, but that is not what was going on in my initial emal here. Here is a run that timed out, but it should not have so I think this is the same issue. I always have perfectly distributed grids like this. DM Object: box 2048 MPI processes type: plex box in 2 dimensions: Min/Max of 0-cells per rank: 8385/8580 Min/Max of 1-cells per rank: 24768/24960 Min/Max of 2-cells per rank: 16384/16384 Labels: celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385)) depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384)) marker: 1 strata with value/size (1 (385)) Face Sets: 1 strata with value/size (1 (381)) Defined by transform from: DM_0x84000002_1 in 2 dimensions: Min/Max of 0-cells per rank: 2145/2244 Min/Max of 1-cells per rank: 6240/6336 Min/Max of 2-cells per rank: 4096/4096 Labels: celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145)) depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096)) marker: 1 strata with value/size (1 (193)) Face Sets: 1 strata with value/size (1 (189)) Defined by transform from: DM_0x84000002_2 in 2 dimensions: Min/Max of 0-cells per rank: 561/612 Min/Max of 1-cells per rank: 1584/1632 Min/Max of 2-cells per rank: 1024/1024 Labels: celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561)) depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024)) marker: 1 strata with value/size (1 (97)) Face Sets: 1 strata with value/size (1 (93)) Defined by transform from: DM_0x84000002_3 in 2 dimensions: Min/Max of 0-cells per rank: 153/180 Min/Max of 1-cells per rank: 408/432 Min/Max of 2-cells per rank: 256/256 Labels: celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153)) depth: 3 strata with value/size (0 (153), 1 (408), 2 (256)) marker: 1 strata with value/size (1 (49)) Face Sets: 1 strata with value/size (1 (45)) Defined by transform from: DM_0x84000002_4 in 2 dimensions: Min/Max of 0-cells per rank: 45/60 Min/Max of 1-cells per rank: 108/120 Min/Max of 2-cells per rank: 64/64 Labels: celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45)) depth: 3 strata with value/size (0 (45), 1 (108), 2 (64)) marker: 1 strata with value/size (1 (25)) Face Sets: 1 strata with value/size (1 (21)) Defined by transform from: DM_0x84000002_5 in 2 dimensions: Min/Max of 0-cells per rank: 15/24 Min/Max of 1-cells per rank: 30/36 Min/Max of 2-cells per rank: 16/16 Labels: celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15)) depth: 3 strata with value/size (0 (15), 1 (30), 2 (16)) marker: 1 strata with value/size (1 (13)) Face Sets: 1 strata with value/size (1 (9)) Defined by transform from: DM_0x84000002_6 in 2 dimensions: Min/Max of 0-cells per rank: 6/12 Min/Max of 1-cells per rank: 9/12 Min/Max of 2-cells per rank: 4/4 Labels: depth: 3 strata with value/size (0 (6), 1 (9), 2 (4)) celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4)) marker: 1 strata with value/size (1 (7)) Face Sets: 1 strata with value/size (1 (3)) 0 TS dt 0.001 time 0. MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0) 0 SNES Function norm 5.415286407365e-03 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 2023-01-08T15:32:43 DUE TO TIME LIMIT *** > > Thanks, > > Matt > > >> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. >> In going up to 64 nodes, something really catastrophic is happening. >> I understand I am not using the machine the way it was intended, but I >> just want to see if there are any options that I could try for a quick >> fix/help. >> >> In a debug build I get a stack trace on many but not all of the 4K >> processes. >> Alas, I am not sure why this job was terminated but every process that I >> checked, that had an "ERROR", had this stack: >> >> 11:57 main *+= >> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR >> slurm-245063.out |g 3160 >> [3160]PETSC ERROR: >> ------------------------------------------------------------------------ >> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or >> the batch system) has told this process to end >> [3160]PETSC ERROR: Try option -start_in_debugger or >> -on_error_attach_debugger >> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >> https://petsc.org/release/faq/ >> [3160]PETSC ERROR: --------------------- Stack Frames >> ------------------------------------ >> [3160]PETSC ERROR: The line numbers in the error traceback are not always >> exact. >> [3160]PETSC ERROR: #1 MPI function >> [3160]PETSC ERROR: #2 PetscCommDuplicate() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248 >> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56 >> [3160]PETSC ERROR: #4 PetscSFCreate() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65 >> [3160]PETSC ERROR: #5 DMLabelGather() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932 >> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177 >> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227 >> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301 >> [3160]PETSC ERROR: #9 DMCopyDS() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117 >> [3160]PETSC ERROR: #10 DMCopyDisc() at >> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143 >> [3160]PETSC ERROR: #11 SetupDiscretization() at >> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755 >> >> Maybe the MPI is just getting overwhelmed*.* >> >> And I was able to get one run to to work (one TS with beuler), and the >> solver performance was horrendous and I see this (attached): >> >> Time (sec): 1.601e+02 1.001 1.600e+02 >> VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 >> 1.1e+05 30 4 0 0 23 30 4 0 0 23 499 >> VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 >> 1.6e+05 39 2 0 0 34 39 2 0 0 34 139 >> VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 >> 1.5e+05 38 2 0 0 32 38 2 0 0 32 189 >> etc, >> KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 >> 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772 >> >> Any ideas would be welcome, >> Thanks, >> Mark >> <cushersolve.txt> >> >> >> > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ > <http://www.cse.buffalo.edu/~knepley/> >