nroots=-1 is like a collective setup_called=false flag, except that this SF is always in that not-set-up state for serial runs. I don't think that's a bug per so, though perhaps you'd like it to be conveyed differently.
Barry Smith <bsm...@petsc.dev> writes: > There is a bug in the routine DMPlexLabelComplete_Internal()! The code > should definitely not have the code route around if (nroots >=0) because > checking the nroots value to decide on the code route is simply nonsense (if > one "knows" "by contract" that nroots is >=0 then the if () test is not > needed. > > The first thing to do is to fix the bug with a PetscCheck() remove the > nonsensical if (nroots >=0) check and rerun you code to see what happens. > > Barry > > Yes it is possible that in your run the nroots is always >= 0 and some MPI > bug is causing the problem but this doesn't change the fact that the current > code is buggy and needs to be fixed before blaming some other bug for the > problem. > > > >> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> >> >> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com >> <mailto:knep...@gmail.com>> wrote: >>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> Mark, >>>> >>>> Looks like the error checking in PetscCommDuplicate() is doing its job. >>>> It is reporting an attempt to use an PETSc object constructer on a subset >>>> of ranks of an MPI_Comm (which is, of course, fundamentally impossible in >>>> the PETSc/MPI model) >>>> >>>> Note that nroots can be negative on a particular rank but >>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in >>>> the code below >>>> >>>> >>>> struct _p_PetscSF { >>>> .... >>>> PetscInt nroots; /* Number of root vertices on current process >>>> (candidates for incoming edges) */ >>>> >>>> But the next routine calls a collective only when nroots >= 0 >>>> >>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, >>>> PetscBool completeCells){ >>>> ... >>>> PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL)); >>>> if (nroots >= 0) { >>>> DMLabel lblRoots, lblLeaves; >>>> IS valueIS, pointIS; >>>> const PetscInt *values; >>>> PetscInt numValues, v; >>>> >>>> /* Pull point contributions from remote leaves into local roots */ >>>> PetscCall(DMLabelGather(label, sfPoint, &lblLeaves)); >>>> >>>> >>>> The code is four years old? How come this problem of calling the >>>> constructure on a subset of ranks hasn't come up since day 1? >>> >>> The contract here is that it should be impossible to have nroots < 0 >>> (meaning the SF is not setup) on a subset of processes. Do we know that >>> this is happening? >> >> Can't imagine a code bug here. Very simple code. >> >> This code does use GAMG as the coarse grid solver in a pretty extreme way. >> GAMG is fairly complicated and not used on such small problems with high >> parallelism. >> It is conceivable that its a GAMG bug, but that is not what was going on in >> my initial emal here. >> >> Here is a run that timed out, but it should not have so I think this is the >> same issue. I always have perfectly distributed grids like this. >> >> DM Object: box 2048 MPI processes >> type: plex >> box in 2 dimensions: >> Min/Max of 0-cells per rank: 8385/8580 >> Min/Max of 1-cells per rank: 24768/24960 >> Min/Max of 2-cells per rank: 16384/16384 >> Labels: >> celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385)) >> depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384)) >> marker: 1 strata with value/size (1 (385)) >> Face Sets: 1 strata with value/size (1 (381)) >> Defined by transform from: >> DM_0x84000002_1 in 2 dimensions: >> Min/Max of 0-cells per rank: 2145/2244 >> Min/Max of 1-cells per rank: 6240/6336 >> Min/Max of 2-cells per rank: 4096/4096 >> Labels: >> celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145)) >> depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096)) >> marker: 1 strata with value/size (1 (193)) >> Face Sets: 1 strata with value/size (1 (189)) >> Defined by transform from: >> DM_0x84000002_2 in 2 dimensions: >> Min/Max of 0-cells per rank: 561/612 >> Min/Max of 1-cells per rank: 1584/1632 >> Min/Max of 2-cells per rank: 1024/1024 >> Labels: >> celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561)) >> depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024)) >> marker: 1 strata with value/size (1 (97)) >> Face Sets: 1 strata with value/size (1 (93)) >> Defined by transform from: >> DM_0x84000002_3 in 2 dimensions: >> Min/Max of 0-cells per rank: 153/180 >> Min/Max of 1-cells per rank: 408/432 >> Min/Max of 2-cells per rank: 256/256 >> Labels: >> celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153)) >> depth: 3 strata with value/size (0 (153), 1 (408), 2 (256)) >> marker: 1 strata with value/size (1 (49)) >> Face Sets: 1 strata with value/size (1 (45)) >> Defined by transform from: >> DM_0x84000002_4 in 2 dimensions: >> Min/Max of 0-cells per rank: 45/60 >> Min/Max of 1-cells per rank: 108/120 >> Min/Max of 2-cells per rank: 64/64 >> Labels: >> celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45)) >> depth: 3 strata with value/size (0 (45), 1 (108), 2 (64)) >> marker: 1 strata with value/size (1 (25)) >> Face Sets: 1 strata with value/size (1 (21)) >> Defined by transform from: >> DM_0x84000002_5 in 2 dimensions: >> Min/Max of 0-cells per rank: 15/24 >> Min/Max of 1-cells per rank: 30/36 >> Min/Max of 2-cells per rank: 16/16 >> Labels: >> celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15)) >> depth: 3 strata with value/size (0 (15), 1 (30), 2 (16)) >> marker: 1 strata with value/size (1 (13)) >> Face Sets: 1 strata with value/size (1 (9)) >> Defined by transform from: >> DM_0x84000002_6 in 2 dimensions: >> Min/Max of 0-cells per rank: 6/12 >> Min/Max of 1-cells per rank: 9/12 >> Min/Max of 2-cells per rank: 4/4 >> Labels: >> depth: 3 strata with value/size (0 (6), 1 (9), 2 (4)) >> celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4)) >> marker: 1 strata with value/size (1 (7)) >> Face Sets: 1 strata with value/size (1 (3)) >> 0 TS dt 0.001 time 0. >> MHD 0) time = 0, Eergy= 2.3259668003585e+00 (plot ID 0) >> 0 SNES Function norm 5.415286407365e-03 >> srun: Job step aborted: Waiting up to 32 seconds for job step to finish. >> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT >> 2023-01-08T15:32:43 DUE TO TIME LIMIT *** >> >> >>> >>> Thanks, >>> >>> Matt >>> >>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> >>>>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. >>>>> In going up to 64 nodes, something really catastrophic is happening. >>>>> I understand I am not using the machine the way it was intended, but I >>>>> just want to see if there are any options that I could try for a quick >>>>> fix/help. >>>>> >>>>> In a debug build I get a stack trace on many but not all of the 4K >>>>> processes. >>>>> Alas, I am not sure why this job was terminated but every process that I >>>>> checked, that had an "ERROR", had this stack: >>>>> >>>>> 11:57 main *+= >>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR >>>>> slurm-245063.out |g 3160 >>>>> [3160]PETSC ERROR: >>>>> ------------------------------------------------------------------------ >>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or >>>>> the batch system) has told this process to end >>>>> [3160]PETSC ERROR: Try option -start_in_debugger or >>>>> -on_error_attach_debugger >>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >>>>> https://petsc.org/release/faq/ >>>>> [3160]PETSC ERROR: --------------------- Stack Frames >>>>> ------------------------------------ >>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not always >>>>> exact. >>>>> [3160]PETSC ERROR: #1 MPI function >>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248 >>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56 >>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65 >>>>> [3160]PETSC ERROR: #5 DMLabelGather() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932 >>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177 >>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227 >>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301 >>>>> [3160]PETSC ERROR: #9 DMCopyDS() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117 >>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at >>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143 >>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at >>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755 >>>>> >>>>> Maybe the MPI is just getting overwhelmed. >>>>> >>>>> And I was able to get one run to to work (one TS with beuler), and the >>>>> solver performance was horrendous and I see this (attached): >>>>> >>>>> Time (sec): 1.601e+02 1.001 1.600e+02 >>>>> VecMDot 111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 >>>>> 1.1e+05 30 4 0 0 23 30 4 0 0 23 499 >>>>> VecNorm 163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 >>>>> 1.6e+05 39 2 0 0 34 39 2 0 0 34 139 >>>>> VecNormalize 154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 >>>>> 1.5e+05 38 2 0 0 32 38 2 0 0 32 189 >>>>> etc, >>>>> KSPSolve 3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 >>>>> 2.8e+05 72 95 45 72 58 72 95 45 72 58 4772 >>>>> >>>>> Any ideas would be welcome, >>>>> Thanks, >>>>> Mark >>>>> <cushersolve.txt> >>>> >>> >>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>