Re: [petsc-dev] Kokkos/Crusher perforance

Barry Smith Fri, 21 Jan 2022 18:08:47 -0800

  Junchao, Mark,

     Some of the logging information is non-sensible, MatMult says all flops 
are done on the GPU (last column) but the GPU flop rate is zero.


     It looks like  MatMult_SeqAIJKokkos() is missing 
PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx 
seem to be missing it. This might explain the crazy 0 GPU flop rate. Can this 
be fixed ASAP?

     Regarding VecOps, sure looks the kernel launches are killing performance. 

           But in particular look at the VecTDot and VecNorm CPU flop rates 
compared to the GPU, much lower, this tells me the MPI_Allreduce is likely 
hurting performance in there also a great deal. It would be good to see a 
single MPI rank job to compare to see performance without the MPI overhead.







> On Jan 21, 2022, at 6:41 PM, Mark Adams <mfad...@lbl.gov> wrote:
> 
> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) 
> on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?).
> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are 
> similar (mat-vec is a little faster w/o, the total is about the same, call it 
> noise)
> 
> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
> cores on the node, then when using 1 core/GPU. With the same size problem of 
> course.
> I was thinking MatMult should be faster with just one MPI process. Oh well, 
> worry about that later.
> 
> The bigger problem, and I have observed this to some extent with the Landau 
> TS/SNES/GPU-solver on the V/A100s, is that the vector operations are 
> expensive or crazy expensive.
> You can see (attached) and the times here that the solve is dominated by 
> not-mat-vec:
> 
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - 
> GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "MatMult              400" jac_out_00*5_8_gpuawaremp*
> MatMult              400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
> 0.0e+00  1 55 62 54  0  27 91100100  0 668874       0      0 0.00e+00    0 
> 0.00e+00 100
> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep 
> "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
> KSPSolve               2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405      0 0.00e+00    0 
> 0.00e+00 100
> 
> Notes about flop counters here, 
> * that MatMult flops are not logged as GPU flops but something is logged 
> nonetheless.
> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at 
> < 1%.
> 
> Anway, not sure how to proceed but I thought I would share.
> Maybe ask the Kokkos guys if the have looked at Crusher.
> 
> Mark
> 
> 
> <jac_out_001_kokkos_Crusher_5_8_gpuawarempi.txt>

Re: [petsc-dev] Kokkos/Crusher perforance

Reply via email to