Re: [petsc-dev] bad cpu/MPI performance problem

Mark Adams Sun, 08 Jan 2023 13:05:05 -0800

On Sun, Jan 8, 2023 at 2:44 PM Matthew Knepley <knep...@gmail.com> wrote:


> On Sun, Jan 8, 2023 at 9:28 AM Barry Smith <bsm...@petsc.dev> wrote:
>
>>
>>   Mark,
>>
>>   Looks like the error checking in PetscCommDuplicate() is doing its job.
>> It is reporting an attempt to use an PETSc object constructer on a subset
>> of ranks of an MPI_Comm (which is, of course, fundamentally impossible in
>> the PETSc/MPI model)
>>
>> Note that nroots can be negative on a particular rank but
>> DMPlexLabelComplete_Internal() is collective on sf based on the comment in
>> the code below
>>
>>
>> struct _p_PetscSF {
>> ....
>>   PetscInt     nroots;  /* Number of root vertices on current process
>> (candidates for incoming edges) */
>>
>> But the next routine calls a collective only when nroots >= 0
>>
>> static PetscErrorCode DMPlexLabelComplete_Internal(DM dm, DMLabel label,
>> PetscBool completeCells){
>> ...
>>   PetscCall(PetscSFGetGraph(sfPoint, &nroots, NULL, NULL, NULL));
>>   if (nroots >= 0) {
>>     DMLabel         lblRoots, lblLeaves;
>>     IS              valueIS, pointIS;
>>     const PetscInt *values;
>>     PetscInt        numValues, v;
>>
>>     /* Pull point contributions from remote leaves into local roots */
>>     PetscCall(DMLabelGather(label, sfPoint, &lblLeaves));
>>
>>
>> The code is four years old? How come this problem of calling the
>> constructure on a subset of ranks hasn't come up since day 1?
>>
>
> The contract here is that it should be impossible to have nroots < 0
> (meaning the SF is not setup) on a subset of processes. Do we know that
> this is happening?
>

Can't imagine a code bug here. Very simple code.

This code does use GAMG as the coarse grid solver in a pretty extreme way.
GAMG is fairly complicated and not used on such small problems with high
parallelism.
It is conceivable that its a GAMG bug, but that is not what was going on in
my initial emal here.

Here is a run that timed out, but it should not have so I think this is the
same issue. I always have perfectly distributed grids like this.

DM Object: box 2048 MPI processes
  type: plex
box in 2 dimensions:
  Min/Max of 0-cells per rank: 8385/8580
  Min/Max of 1-cells per rank: 24768/24960
  Min/Max of 2-cells per rank: 16384/16384
Labels:
  celltype: 3 strata with value/size (1 (24768), 3 (16384), 0 (8385))
  depth: 3 strata with value/size (0 (8385), 1 (24768), 2 (16384))
  marker: 1 strata with value/size (1 (385))
  Face Sets: 1 strata with value/size (1 (381))
  Defined by transform from:
  DM_0x84000002_1 in 2 dimensions:
    Min/Max of 0-cells per rank:   2145/2244
    Min/Max of 1-cells per rank:   6240/6336
    Min/Max of 2-cells per rank:   4096/4096
  Labels:
    celltype: 3 strata with value/size (1 (6240), 3 (4096), 0 (2145))
    depth: 3 strata with value/size (0 (2145), 1 (6240), 2 (4096))
    marker: 1 strata with value/size (1 (193))
    Face Sets: 1 strata with value/size (1 (189))
    Defined by transform from:
    DM_0x84000002_2 in 2 dimensions:
      Min/Max of 0-cells per rank:     561/612
      Min/Max of 1-cells per rank:     1584/1632
      Min/Max of 2-cells per rank:     1024/1024
    Labels:
      celltype: 3 strata with value/size (1 (1584), 3 (1024), 0 (561))
      depth: 3 strata with value/size (0 (561), 1 (1584), 2 (1024))
      marker: 1 strata with value/size (1 (97))
      Face Sets: 1 strata with value/size (1 (93))
      Defined by transform from:
      DM_0x84000002_3 in 2 dimensions:
        Min/Max of 0-cells per rank:       153/180
        Min/Max of 1-cells per rank:       408/432
        Min/Max of 2-cells per rank:       256/256
      Labels:
        celltype: 3 strata with value/size (1 (408), 3 (256), 0 (153))
        depth: 3 strata with value/size (0 (153), 1 (408), 2 (256))
        marker: 1 strata with value/size (1 (49))
        Face Sets: 1 strata with value/size (1 (45))
        Defined by transform from:
        DM_0x84000002_4 in 2 dimensions:
          Min/Max of 0-cells per rank:         45/60
          Min/Max of 1-cells per rank:         108/120
          Min/Max of 2-cells per rank:         64/64
        Labels:
          celltype: 3 strata with value/size (1 (108), 3 (64), 0 (45))
          depth: 3 strata with value/size (0 (45), 1 (108), 2 (64))
          marker: 1 strata with value/size (1 (25))
          Face Sets: 1 strata with value/size (1 (21))
          Defined by transform from:
          DM_0x84000002_5 in 2 dimensions:
            Min/Max of 0-cells per rank:           15/24
            Min/Max of 1-cells per rank:           30/36
            Min/Max of 2-cells per rank:           16/16
          Labels:
            celltype: 3 strata with value/size (1 (30), 3 (16), 0 (15))
            depth: 3 strata with value/size (0 (15), 1 (30), 2 (16))
            marker: 1 strata with value/size (1 (13))
            Face Sets: 1 strata with value/size (1 (9))
            Defined by transform from:
            DM_0x84000002_6 in 2 dimensions:
              Min/Max of 0-cells per rank:             6/12
              Min/Max of 1-cells per rank:             9/12
              Min/Max of 2-cells per rank:             4/4
            Labels:
              depth: 3 strata with value/size (0 (6), 1 (9), 2 (4))
              celltype: 3 strata with value/size (0 (6), 1 (9), 3 (4))
              marker: 1 strata with value/size (1 (7))
              Face Sets: 1 strata with value/size (1 (3))
0 TS dt 0.001 time 0.
MHD    0) time =         0, Eergy=  2.3259668003585e+00 (plot ID 0)
    0 SNES Function norm 5.415286407365e-03
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 245100.0 ON crusher002 CANCELLED AT
2023-01-08T15:32:43 DUE TO TIME LIMIT ***



>
>   Thanks,
>
>     Matt
>
>
>> On Jan 8, 2023, at 12:21 PM, Mark Adams <mfad...@lbl.gov> wrote:
>>
>> I am running on Crusher, CPU only, 64 cores per node with Plex/PetscFE.
>> In going up to 64 nodes, something really catastrophic is happening.
>> I understand I am not using the machine the way it was intended, but I
>> just want to see if there are any options that I could try for a quick
>> fix/help.
>>
>> In a debug build I get a stack trace on many but not all of the 4K
>> processes.
>> Alas, I am not sure why this job was terminated but every process that I
>> checked, that had an "ERROR", had this stack:
>>
>> 11:57 main *+=
>> crusher:/gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/data$ grep ERROR
>> slurm-245063.out |g 3160
>> [3160]PETSC ERROR:
>> ------------------------------------------------------------------------
>> [3160]PETSC ERROR: Caught signal number 15 Terminate: Some process (or
>> the batch system) has told this process to end
>> [3160]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> [3160]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and
>> https://petsc.org/release/faq/
>> [3160]PETSC ERROR: ---------------------  Stack Frames
>> ------------------------------------
>> [3160]PETSC ERROR: The line numbers in the error traceback are not always
>> exact.
>> [3160]PETSC ERROR: #1 MPI function
>> [3160]PETSC ERROR: #2 PetscCommDuplicate() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/tagm.c:248
>> [3160]PETSC ERROR: #3 PetscHeaderCreate_Private() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/sys/objects/inherit.c:56
>> [3160]PETSC ERROR: #4 PetscSFCreate() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/vec/is/sf/interface/sf.c:65
>> [3160]PETSC ERROR: #5 DMLabelGather() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/label/dmlabel.c:1932
>> [3160]PETSC ERROR: #6 DMPlexLabelComplete_Internal() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:177
>> [3160]PETSC ERROR: #7 DMPlexLabelComplete() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/impls/plex/plexsubmesh.c:227
>> [3160]PETSC ERROR: #8 DMCompleteBCLabels_Internal() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:5301
>> [3160]PETSC ERROR: #9 DMCopyDS() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6117
>> [3160]PETSC ERROR: #10 DMCopyDisc() at
>> /gpfs/alpine/csc314/scratch/adams/petsc/src/dm/interface/dm.c:6143
>> [3160]PETSC ERROR: #11 SetupDiscretization() at
>> /gpfs/alpine/csc314/scratch/adams/mg-m3dc1/src/mhd_2field.c:755
>>
>> Maybe the MPI is just getting overwhelmed*.*
>>
>> And I was able to get one run to to work (one TS with beuler), and the
>> solver performance was horrendous and I see this (attached):
>>
>> Time (sec):           1.601e+02     1.001   1.600e+02
>> VecMDot           111712 1.0 5.1684e+01 1.4 2.32e+07 12.8 0.0e+00 0.0e+00
>> 1.1e+05 30  4  0  0 23  30  4  0  0 23   499
>> VecNorm           163478 1.0 6.6660e+01 1.2 1.51e+07 21.5 0.0e+00 0.0e+00
>> 1.6e+05 39  2  0  0 34  39  2  0  0 34   139
>> VecNormalize      154599 1.0 6.3942e+01 1.2 2.19e+07 23.3 0.0e+00 0.0e+00
>> 1.5e+05 38  2  0  0 32  38  2  0  0 32   189
>> etc,
>> KSPSolve               3 1.0 1.1553e+02 1.0 1.34e+09 47.1 2.8e+09 6.0e+01
>> 2.8e+05 72 95 45 72 58  72 95 45 72 58  4772
>>
>> Any ideas would be welcome,
>> Thanks,
>> Mark
>> <cushersolve.txt>
>>
>>
>>
>
> --
> What most experimenters take for granted before they begin their
> experiments is infinitely more interesting than any results to which their
> experiments lead.
> -- Norbert Wiener
>
> https://www.cse.buffalo.edu/~knepley/
> <http://www.cse.buffalo.edu/~knepley/>
>

Re: [petsc-dev] bad cpu/MPI performance problem

Reply via email to