On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <mfad...@lbl.gov> wrote: > Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End() right? > No, PetscLogGpuTime() does not know the flops of the caller.
> > On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <bsm...@petsc.dev> wrote: > >> >> Mark, >> >> Fix the logging before you run more. It will help with seeing the true >> disparity between the MatMult and the vector ops. >> >> >> On Jan 21, 2022, at 9:37 PM, Mark Adams <mfad...@lbl.gov> wrote: >> >> Here is one with 2M / GPU. Getting better. >> >> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <bsm...@petsc.dev> wrote: >> >>> >>> Matt is correct, vectors are way too small. >>> >>> BTW: Now would be a good time to run some of the Report I benchmarks >>> on Crusher to get a feel for the kernel launch times and performance on >>> VecOps. >>> >>> Also Report 2. >>> >>> Barry >>> >>> >>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <knep...@gmail.com> wrote: >>> >>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfad...@lbl.gov> wrote: >>> >>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian >>>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it >>>> MI200?). >>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware >>>> MPI are similar (mat-vec is a little faster w/o, the total is about the >>>> same, call it noise) >>>> >>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all >>>> 64 cores on the node, then when using 1 core/GPU. With the same size >>>> problem of course. >>>> I was thinking MatMult should be faster with just one MPI process. Oh >>>> well, worry about that later. >>>> >>>> The bigger problem, and I have observed this to some extent with the >>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are >>>> expensive or crazy expensive. >>>> You can see (attached) and the times here that the solve is dominated >>>> by not-mat-vec: >>>> >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- *Total GPU * - CpuToGpu - >>>> - GpuToCpu - GPU >>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>> Reduct %T %F %M %L %R %T %F %M %L %R *Mflop/s Mflop/s* Count Size >>>> Count Size %F >>>> >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> 17:15 main= >>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "MatMult >>>> 400" jac_out_00*5_8_gpuawaremp* >>>> MatMult 400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05 >>>> 1.6e+04 0.0e+00 1 55 62 54 0 27 91100100 0 *668874 0* 0 >>>> 0.00e+00 0 0.00e+00 100 >>>> 17:15 main= >>>> /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ grep "KSPSolve >>>> 2" jac_out_001*_5_8_gpuawaremp* >>>> KSPSolve 2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05 >>>> 1.6e+04 1.2e+03 4 60 62 54 61 100100100100100 *208923 1094405* >>>> 0 0.00e+00 0 0.00e+00 100 >>>> >>>> Notes about flop counters here, >>>> * that MatMult flops are not logged as GPU flops but something is >>>> logged nonetheless. >>>> * The GPU flop rate is 5x the total flop rate in KSPSolve :\ >>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we >>>> are at < 1%. >>>> >>> >>> This looks complicated, so just a single remark: >>> >>> My understanding of the benchmarking of vector ops led by Hannah was >>> that you needed to be much >>> bigger than 16M to hit peak. I need to get the tech report, but on 8 >>> GPUs I would think you would be >>> at 10% of peak or something right off the bat at these sizes. Barry, is >>> that right? >>> >>> Thanks, >>> >>> Matt >>> >>> >>>> Anway, not sure how to proceed but I thought I would share. >>>> Maybe ask the Kokkos guys if the have looked at Crusher. >>>> >>>> Mark >>>> >>> -- >>> What most experimenters take for granted before they begin their >>> experiments is infinitely more interesting than any results to which their >>> experiments lead. >>> -- Norbert Wiener >>> >>> https://www.cse.buffalo.edu/~knepley/ >>> <http://www.cse.buffalo.edu/~knepley/> >>> >>> >>> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt> >> >> >>