> On Jan 22, 2022, at 10:00 PM, Junchao Zhang <junchao.zh...@gmail.com> wrote: > > > > > On Sat, Jan 22, 2022 at 5:00 PM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: > > The GPU flop rate (when 100 percent flops on the GPU) should always be > higher than the overall flop rate (the previous column). For large problems > they should be similar, for small problems the GPU one may be much higher. > > If the CPU one is higher (when 100 percent flops on the GPU) something must > be wrong with the logging. I looked at the code for the two cases and didn't > see anything obvious. > > Junchao and Jacob, > I think some of the timing code in the Kokkos interface is wrong. > > * The PetscLogGpuTimeBegin/End should be inside the viewer access code > not outside it. (The GPU time is an attempt to best time the kernels, not > other processing around the use of the kernels, that other stuff is captured > in the general LogEventBegin/End. > Good point > * The use of WaitForKokkos() is confusing and seems inconsistent. > I need to have a look. Until now, I have not paid much attention to kokkos > profiling.
That is what is so great about Mark. He makes us do what we should have done before :-) > -For example it is used in VecTDot_SeqKokkos() which I would > think has a barrier anyways because it puts a scalar result into update? > -Plus PetscLogGpuTimeBegin/End is suppose to already have > suitable system (that Hong added) to ensure the kernel is complete; reading > the manual page and looking at Jacobs cupmcontext.hpp it seems to be there so > I don't think WaitForKokkos() is needed in most places (or is Kokkos > asynchronous and needs this for correctness?) > But these won't explain the strange result of overall flop rate being higher > than GPU flop rate. > > Barry > > > > > >> On Jan 22, 2022, at 11:44 AM, Mark Adams <mfad...@lbl.gov >> <mailto:mfad...@lbl.gov>> wrote: >> >> I am getting some funny timings and I'm trying to figure it out. >> I figure the gPU flop rates are bit higher because the timers are inside of >> the CPU timers, but some are a lot bigger or inverted >> >> --- Event Stage 2: KSP Solve only >> >> MatMult 400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 >> 0.0e+00 2 55 62 54 0 68 91100100 0 671849 857147 0 0.00e+00 0 >> 0.00e+00 100 >> MatView 2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 2.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> KSPSolve 2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 >> 1.2e+03 2 60 62 54 60 100100100100100 512399 804048 0 0.00e+00 0 >> 0.00e+00 100 >> SFPack 400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFUnpack 400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecTDot 802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 >> 8.0e+02 0 2 0 0 40 13 3 0 0 67 69996 488328 0 0.00e+00 0 >> 0.00e+00 100 >> VecNorm 402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 >> 4.0e+02 0 1 0 0 20 6 1 0 0 33 54744 571507 0 0.00e+00 0 >> 0.00e+00 100 >> VecCopy 4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecSet 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecAXPY 800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 2 0 0 0 6 3 0 0 0 247787 448304 0 0.00e+00 0 >> 0.00e+00 100 >> VecAYPX 398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 1 0 0 0 5 1 0 0 0 63107 77030 0 0.00e+00 0 >> 0.00e+00 100 >> VecPointwiseMult 402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 2 1 0 0 0 138502 262413 0 0.00e+00 0 >> 0.00e+00 100 >> VecScatterBegin 400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 >> 0.0e+00 0 0 62 54 0 5 0100100 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterEnd 400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> PCApply 402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 2 1 0 0 0 138396 262413 0 0.00e+00 0 >> 0.00e+00 100 >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> >> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang <junchao.zh...@gmail.com >> <mailto:junchao.zh...@gmail.com>> wrote: >> >> >> >> On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <mfad...@lbl.gov >> <mailto:mfad...@lbl.gov>> wrote: >> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End() right? >> No, PetscLogGpuTime() does not know the flops of the caller. >> >> >> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <bsm...@petsc.dev >> <mailto:bsm...@petsc.dev>> wrote: >> >> Mark, >> >> Fix the logging before you run more. It will help with seeing the true >> disparity between the MatMult and the vector ops. >> >> >>> On Jan 21, 2022, at 9:37 PM, Mark Adams <mfad...@lbl.gov >>> <mailto:mfad...@lbl.gov>> wrote: >>> >>> Here is one with 2M / GPU. Getting better. >>> >>> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>> >>> Matt is correct, vectors are way too small. >>> >>> BTW: Now would be a good time to run some of the Report I benchmarks on >>> Crusher to get a feel for the kernel launch times and performance on VecOps. >>> >>> Also Report 2. >>> >>> Barry >>> >>> >>>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <knep...@gmail.com >>>> <mailto:knep...@gmail.com>> wrote: >>>> >>>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfad...@lbl.gov >>>> <mailto:mfad...@lbl.gov>> wrote: >>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian >>>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it >>>> MI200?). >>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI >>>> are similar (mat-vec is a little faster w/o, the total is about the same, >>>> call it noise) >>>> >>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 >>>> cores on the node, then when using 1 core/GPU. With the same size problem >>>> of course. >>>> I was thinking MatMult should be faster with just one MPI process. Oh >>>> well, worry about that later. >>>> >>>> The bigger problem, and I have observed this to some extent with the >>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations >>>> are expensive or crazy expensive. >>>> You can see (attached) and the times here that the solve is dominated by >>>> not-mat-vec: >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>> GpuToCpu - GPU >>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>> Count Size %F >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ >>>> grep "MatMult 400" jac_out_00*5_8_gpuawaremp* >>>> MatMult 400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 >>>> 0.0e+00 1 55 62 54 0 27 91100100 0 668874 0 0 0.00e+00 0 >>>> 0.00e+00 100 >>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ >>>> grep "KSPSolve 2" jac_out_001*_5_8_gpuawaremp* >>>> KSPSolve 2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 >>>> 1.2e+03 4 60 62 54 61 100100100100100 208923 1094405 0 0.00e+00 >>>> 0 0.00e+00 100 >>>> >>>> Notes about flop counters here, >>>> * that MatMult flops are not logged as GPU flops but something is logged >>>> nonetheless. >>>> * The GPU flop rate is 5x the total flop rate in KSPSolve :\ >>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are >>>> at < 1%. >>>> >>>> This looks complicated, so just a single remark: >>>> >>>> My understanding of the benchmarking of vector ops led by Hannah was that >>>> you needed to be much >>>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs >>>> I would think you would be >>>> at 10% of peak or something right off the bat at these sizes. Barry, is >>>> that right? >>>> >>>> Thanks, >>>> >>>> Matt >>>> >>>> Anway, not sure how to proceed but I thought I would share. >>>> Maybe ask the Kokkos guys if the have looked at Crusher. >>>> >>>> Mark >>>> -- >>>> What most experimenters take for granted before they begin their >>>> experiments is infinitely more interesting than any results to which their >>>> experiments lead. >>>> -- Norbert Wiener >>>> >>>> https://www.cse.buffalo.edu/~knepley/ >>>> <http://www.cse.buffalo.edu/~knepley/> >>> >>> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt> >> >> <jac_out_001_kokkos_Crusher_5_8_notpl.txt><jac_out_001_kokkos_Crusher_6_8_notpl.txt> >