Junchao, Mark, Some of the logging information is non-sensible, MatMult says all flops are done on the GPU (last column) but the GPU flop rate is zero.
It looks like MatMult_SeqAIJKokkos() is missing PetscLogGpuTimeBegin()/End() in fact all the operations in aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU flop rate. Can this be fixed ASAP? Regarding VecOps, sure looks the kernel launches are killing performance. But in particular look at the VecTDot and VecNorm CPU flop rates compared to the GPU, much lower, this tells me the MPI_Allreduce is likely hurting performance in there also a great deal. It would be good to see a single MPI rank job to compare to see performance without the MPI overhead. > On Jan 21, 2022, at 6:41 PM, Mark Adams <mfad...@lbl.gov> wrote: > > I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian (ex13) > on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it MI200?). > This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI are > similar (mat-vec is a little faster w/o, the total is about the same, call it > noise) > > I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 > cores on the node, then when using 1 core/GPU. With the same size problem of > course. > I was thinking MatMult should be faster with just one MPI process. Oh well, > worry about that later. > > The bigger problem, and I have observed this to some extent with the Landau > TS/SNES/GPU-solver on the V/A100s, is that the vector operations are > expensive or crazy expensive. > You can see (attached) and the times here that the solve is dominated by > not-mat-vec: > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - > GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count > Size %F > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep > "MatMult 400" jac_out_00*5_8_gpuawaremp* > MatMult 400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 > 0.0e+00 1 55 62 54 0 27 91100100 0 668874 0 0 0.00e+00 0 > 0.00e+00 100 > 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep > "KSPSolve 2" jac_out_001*_5_8_gpuawaremp* > KSPSolve 2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 > 1.2e+03 4 60 62 54 61 100100100100100 208923 1094405 0 0.00e+00 0 > 0.00e+00 100 > > Notes about flop counters here, > * that MatMult flops are not logged as GPU flops but something is logged > nonetheless. > * The GPU flop rate is 5x the total flop rate in KSPSolve :\ > * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are at > < 1%. > > Anway, not sure how to proceed but I thought I would share. > Maybe ask the Kokkos guys if the have looked at Crusher. > > Mark > > > <jac_out_001_kokkos_Crusher_5_8_gpuawarempi.txt>