I should be able to add this profiling now. On Fri, Jan 21, 2022 at 10:48 PM Junchao Zhang <junchao.zh...@gmail.com> wrote:
> > > > On Fri, Jan 21, 2022 at 8:08 PM Barry Smith <bsm...@petsc.dev> wrote: > >> >> Junchao, Mark, >> >> Some of the logging information is non-sensible, MatMult says all >> flops are done on the GPU (last column) but the GPU flop rate is zero. >> >> It looks like MatMult_SeqAIJKokkos() is missing >> PetscLogGpuTimeBegin()/End() in fact all the operations in >> aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU >> flop rate. Can this be fixed ASAP? >> > I will add this profiling temporarily. I may use Kokkos own profiling > APIs later. > > >> >> Regarding VecOps, sure looks the kernel launches are killing >> performance. >> >> But in particular look at the VecTDot and VecNorm CPU flop >> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is >> likely hurting performance in there also a great deal. It would be good to >> see a single MPI rank job to compare to see performance without the MPI >> overhead. >> >> >> >> >> >> >> >> On Jan 21, 2022, at 6:41 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian >> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it >> MI200?). >> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI >> are similar (mat-vec is a little faster w/o, the total is about the same, >> call it noise) >> >> I found that MatMult was about 3x faster using 8 cores/GPU, that is all >> 64 cores on the node, then when using 1 core/GPU. With the same size >> problem of course. >> I was thinking MatMult should be faster with just one MPI process. Oh >> well, worry about that later. >> >> The bigger problem, and I have observed this to some extent with the >> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are >> expensive or crazy expensive. >> You can see (attached) and the times here that the solve is dominated by >> not-mat-vec: >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- *Total GPU * - CpuToGpu - - >> GpuToCpu - GPU >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R *Mflop/s Mflop/s* Count Size >> Count Size %F >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ >> grep "MatMult 400" jac_out_00*5_8_gpuawaremp* >> MatMult 400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05 >> 1.6e+04 0.0e+00 1 55 62 54 0 27 91100100 0 *668874 0* 0 >> 0.00e+00 0 0.00e+00 100 >> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ >> grep "KSPSolve 2" jac_out_001*_5_8_gpuawaremp* >> KSPSolve 2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05 >> 1.6e+04 1.2e+03 4 60 62 54 61 100100100100100 *208923 1094405* 0 >> 0.00e+00 0 0.00e+00 100 >> >> Notes about flop counters here, >> * that MatMult flops are not logged as GPU flops but something is logged >> nonetheless. >> * The GPU flop rate is 5x the total flop rate in KSPSolve :\ >> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we >> are at < 1%. >> >> Anway, not sure how to proceed but I thought I would share. >> Maybe ask the Kokkos guys if the have looked at Crusher. >> >> Mark >> >> >> <jac_out_001_kokkos_Crusher_5_8_gpuawarempi.txt> >> >> >>