Re: [petsc-dev] bad cpu/MPI performance problem

Jed Brown Sun, 08 Jan 2023 13:29:43 -0800

nroots=-1 is like a collective setup_called=false flag, except that this SF is 
always in that not-set-up state for serial runs. I don't think that's a bug per 
so, though perhaps you'd like it to be conveyed differently.


Barry Smith <bsm...@petsc.dev> writes:

>    There is a bug in the routine DMPlexLabelComplete_Internal()! The code 
> should definitely not have the code route around if (nroots >=0) because 
> checking the nroots value to decide on the code route is simply nonsense (if 
> one "knows" "by contract" that nroots is >=0 then the if () test is not 
> needed.
>
>    The first thing to do is to fix the bug with a PetscCheck() remove the 
> nonsensical if (nroots >=0) check and rerun you code to see what happens.
>
>   Barry
>
> Yes it is possible that in your run the nroots is always >= 0 and some MPI 
> bug is causing the problem but this doesn't change the fact that the current 
> code is buggy and needs to be fixed before blaming some other bug for the 
> problem.
>
>
>
>> On Jan 8, 2023, at 4:04 PM, Mark Adams <mfad...@lbl.gov> wrote:
>> 
>> 
>> 
>> On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com 
>> <mailto:knep...@gmail.com>> wrote:
>>> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>>   Mark,
>>>> 
>>>>   Looks like the error checking in PetscCommDuplicate() is doing its job. 
>>>> It is reporting an attempt to use an PETSc object constructer on a subset 
>>>> of ranks of an MPI_Comm (which is, of course, fundamentally impossible in 
>>>> the PETSc/MPI model)
>>>> 
>>>> Note that nroots can be negative on a particular rank but 
>>>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in 
>>>> the code below
>>>> 
>>>> 
>>>> struct _p_PetscSF {
>>>> ....
>>>>   PetscInt     nroots;  /* Number of root vertices on current process 
>>>> (candidates for incoming edges) */
>>>> 
>>>> But the next routine calls a collective only when nroots >= 0 
>>>> 
>>>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label, 
>>>> PetscBool completeCells){
>>>> ...
>>>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>>>   if (nroots >= 0) {
>>>>     DMLabel         lblRoots, lblLeaves;
>>>>     IS              valueIS, pointIS;
>>>>     const PetscInt *values;
>>>>     PetscInt        numValues, v;
>>>> 
>>>>     /* Pull point contributions from remote leaves into local roots */
>>>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>>> 
>>>> 
>>>> The code is four years old? How come this problem of calling the 
>>>> constructure on a subset of ranks hasn't come up since day 1? 
>>> 
>>> The contract here is that it should be impossible to have nroots < 0 
>>> (meaning the SF is not setup) on a subset of processes. Do we know that 
>>> this is happening?
>> 
>> Can't imagine a code bug here. Very simple code.
>> 
>> This code does use GAMG as the coarse grid solver in a pretty extreme way.
>> GAMG is fairly complicated and not used on such small problems with high 
>> parallelism.
>> It is conceivable that its a GAMG bug, but that is not what was going on in 
>> my initial emal here.
>> 
>> Here is a run that timed out, but it should not have so I think this is the 
>> same issue. I always have perfectly distributed grids like this.
>> 
>> DM Object: box 2048 MPI processes
>>   type: plex
>> box in 2 dimensions:
>>   Min/Max of 0-cells per rank: 8385/8580
>>   Min/Max of 1-cells per rank: 24768/24960
>>   Min/Max of 2-cells per rank: 16384/16384
>> Labels:
>>   celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
>>   depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
>>   marker: 1 strata with value/size (1 (385))
>>   Face Sets: 1 strata with value/size (1 (381))
>>   Defined by transform from:
>>   DM_0x84000002_1 in 2 dimensions:
>>     Min/Max of 0-cells per rank:   2145/2244
>>     Min/Max of 1-cells per rank:   6240/6336
>>     Min/Max of 2-cells per rank:   4096/4096
>>   Labels:
>>     celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
>>     depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
>>     marker: 1 strata with value/size (1 (193))
>>     Face Sets: 1 strata with value/size (1 (189))
>>     Defined by transform from:
>>     DM_0x84000002_2 in 2 dimensions:
>>       Min/Max of 0-cells per rank:     561/612
>>       Min/Max of 1-cells per rank:     1584/1632
>>       Min/Max of 2-cells per rank:     1024/1024
>>     Labels:
>>       celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
>>       depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
>>       marker: 1 strata with value/size (1 (97))
>>       Face Sets: 1 strata with value/size (1 (93))
>>       Defined by transform from:
>>       DM_0x84000002_3 in 2 dimensions:
>>         Min/Max of 0-cells per rank:       153/180
>>         Min/Max of 1-cells per rank:       408/432
>>         Min/Max of 2-cells per rank:       256/256
>>       Labels:
>>         celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
>>         depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
>>         marker: 1 strata with value/size (1 (49))
>>         Face Sets: 1 strata with value/size (1 (45))
>>         Defined by transform from:
>>         DM_0x84000002_4 in 2 dimensions:
>>           Min/Max of 0-cells per rank:         45/60
>>           Min/Max of 1-cells per rank:         108/120
>>           Min/Max of 2-cells per rank:         64/64
>>         Labels:
>>           celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
>>           depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
>>           marker: 1 strata with value/size (1 (25))
>>           Face Sets: 1 strata with value/size (1 (21))
>>           Defined by transform from:
>>           DM_0x84000002_5 in 2 dimensions:
>>             Min/Max of 0-cells per rank:           15/24
>>             Min/Max of 1-cells per rank:           30/36
>>             Min/Max of 2-cells per rank:           16/16
>>           Labels:
>>             celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
>>             depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
>>             marker: 1 strata with value/size (1 (13))
>>             Face Sets: 1 strata with value/size (1 (9))
>>             Defined by transform from:
>>             DM_0x84000002_6 in 2 dimensions:
>>               Min/Max of 0-cells per rank:             6/12
>>               Min/Max of 1-cells per rank:             9/12
>>               Min/Max of 2-cells per rank:             4/4
>>             Labels:
>>               depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
>>               celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
>>               marker: 1 strata with value/size (1 (7))
>>               Face Sets: 1 strata with value/size (1 (3))
>> 0 TS dt 0.001 time 0.
>> MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
>>     0 SNES Function norm 5.415286407365e-03
>> srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
>> slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT 
>> 2023-01-08T15:32:43 DUE TO TIME LIMIT ***
>> 
>>  
>>> 
>>>   Thanks,
>>> 
>>>     Matt
>>>  
>>>>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>> 
>>>>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE. 
>>>>> In going up to 64 nodes, something really catastrophic is happening. 
>>>>> I understand I am not using the machine the way it was intended, but I 
>>>>> just want to see if there are any options that I could try for a quick 
>>>>> fix/help.
>>>>> 
>>>>> In a debug build I get a stack trace on many but not all of the 4K 
>>>>> processes. 
>>>>> Alas, I am not sure why this job was terminated but every process that I 
>>>>> checked, that had an "ERROR", had this stack:
>>>>> 
>>>>> 11:57 main *+= 
>>>>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR 
>>>>> slurm-245063.out |g 3160
>>>>> [3160]PETSC ERROR: 
>>>>> ------------------------------------------------------------------------
>>>>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or 
>>>>> the batch system) has told this process to end
>>>>> [3160]PETSC ERROR: Try option -start_in_debugger or 
>>>>> -on_error_attach_debugger
>>>>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
>>>>> https://petsc.org/release/faq/
>>>>> [3160]PETSC ERROR: ---------------------  Stack Frames 
>>>>> ------------------------------------
>>>>> [3160]PETSC ERROR: The line numbers in the error traceback are not always 
>>>>> exact.
>>>>> [3160]PETSC ERROR: #1 MPI function
>>>>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>>>>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>>>>> [3160]PETSC ERROR: #4 PetscSFCreate() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>>>>> [3160]PETSC ERROR: #5 DMLabelGather() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>>>>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>>>>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>>>>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>>>>> [3160]PETSC ERROR: #9 DMCopyDS() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>>>>> [3160]PETSC ERROR: #10 DMCopyDisc() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>>>>> [3160]PETSC ERROR: #11 SetupDiscretization() at 
>>>>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>>>> 
>>>>> Maybe the MPI is just getting overwhelmed. 
>>>>> 
>>>>> And I was able to get one run to to work (one TS with beuler), and the 
>>>>> solver performance was horrendous and I see this (attached):
>>>>> 
>>>>> Time (sec):           1.601e+02     1.001   1.600e+02
>>>>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00 
>>>>> 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>>>>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00 
>>>>> 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>>>>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00 
>>>>> 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>>>>> etc,
>>>>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01 
>>>>> 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>>>> 
>>>>> Any ideas would be welcome,
>>>>> Thanks,
>>>>> Mark
>>>>> <cushersolve.txt>
>>>> 
>>> 
>>> 
>>> -- 
>>> What most experimenters take for granted before they begin their 
>>> experiments is infinitely more interesting than any results to which their 
>>> experiments lead.
>>> -- Norbert Wiener
>>> 
>>> https://www.cse.buffalo.edu/~knepley/ <http://www.cse.buffalo.edu/~knepley/>

Re: [petsc-dev] bad cpu/MPI performance problem

Reply via email to