Re: [petsc-dev] Kokkos/Crusher perforance

Mark Adams Sat, 22 Jan 2022 05:47:57 -0800

I should be able to add this profiling now.

On Fri, Jan 21, 2022 at 10:48 PM Junchao Zhang <[email protected]>
wrote:


>
>
>
> On Fri, Jan 21, 2022 at 8:08 PM Barry Smith <[email protected]> wrote:
>
>>
>>   Junchao, Mark,
>>
>>      Some of the logging information is non-sensible, MatMult says all
>> flops are done on the GPU (last column) but the GPU flop rate is zero.
>>
>>      It looks like  MatMult_SeqAIJKokkos() is missing
>> PetscLogGpuTimeBegin()/End() in fact all the operations in
>> aijkok.kokkos.cxx seem to be missing it. This might explain the crazy 0 GPU
>> flop rate. Can this be fixed ASAP?
>>
> I will add this profiling temporarily.  I may use Kokkos own profiling
> APIs later.
>
>
>>
>>      Regarding VecOps, sure looks the kernel launches are killing
>> performance.
>>
>>            But in particular look at the VecTDot and VecNorm CPU flop
>> rates compared to the GPU, much lower, this tells me the MPI_Allreduce is
>> likely hurting performance in there also a great deal. It would be good to
>> see a single MPI rank job to compare to see performance without the MPI
>> overhead.
>>
>>
>>
>>
>>
>>
>>
>> On Jan 21, 2022, at 6:41 PM, Mark Adams <[email protected]> wrote:
>>
>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian
>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it
>> MI200?).
>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI
>> are similar (mat-vec is a little faster w/o, the total is about the same,
>> call it noise)
>>
>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all
>> 64 cores on the node, then when using 1 core/GPU. With the same size
>> problem of course.
>> I was thinking MatMult should be faster with just one MPI process. Oh
>> well, worry about that later.
>>
>> The bigger problem, and I have observed this to some extent with the
>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations are
>> expensive or crazy expensive.
>> You can see (attached) and the times here that the solve is dominated by
>> not-mat-vec:
>>
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop
>>        --- Global ---  --- Stage ----  *Total   GPU *   - CpuToGpu -   -
>> GpuToCpu - GPU
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>  Reduct  %T %F %M %L %R  %T %F %M %L %R *Mflop/s Mflop/s* Count   Size
>> Count   Size  %F
>>
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "MatMult              400" jac_out_00*5_8_gpuawaremp*
>> MatMult              400 1.0 *1.2507e+00* 1.3 1.34e+10 1.1 3.7e+05
>> 1.6e+04 0.0e+00  1 55 62 54  0  27 91100100  0 *668874       0*      0
>> 0.00e+00    0 0.00e+00 100
>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$
>> grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
>> KSPSolve               2 1.0 *4.4173e+00* 1.0 1.48e+10 1.1 3.7e+05
>> 1.6e+04 1.2e+03  4 60 62 54 61 100100100100100 *208923   1094405*      0
>> 0.00e+00    0 0.00e+00 100
>>
>> Notes about flop counters here,
>> * that MatMult flops are not logged as GPU flops but something is logged
>> nonetheless.
>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we
>> are at < 1%.
>>
>> Anway, not sure how to proceed but I thought I would share.
>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>
>> Mark
>>
>>
>> <jac_out_001_kokkos_Crusher_5_8_gpuawarempi.txt>
>>
>>
>>

Re: [petsc-dev] Kokkos/Crusher perforance

Reply via email to