Re: [petsc-dev] Kokkos/Crusher perforance

Barry Smith Sat, 22 Jan 2022 20:16:39 -0800


> On Jan 22, 2022, at 10:00 PM, Junchao Zhang <junchao.zh...@gmail.com> wrote:
> 
> 
> 
> 
> On Sat, Jan 22, 2022 at 5:00 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
>   The GPU flop rate (when 100 percent flops on the GPU) should always be 
> higher than the overall flop rate (the previous column). For large problems 
> they should be similar, for small problems the GPU one may be much higher.
> 
>   If the CPU one is higher (when 100 percent flops on the GPU) something must 
> be wrong with the logging. I looked at the code for the two cases and didn't 
> see anything obvious.
> 
>   Junchao and Jacob,
>       I think some of the timing code in the Kokkos interface is wrong. 
> 
>     *  The PetscLogGpuTimeBegin/End should be inside the viewer access code 
> not outside it. (The GPU time is an attempt to best time the kernels, not 
> other processing around the use of the kernels, that other stuff is captured 
> in the general LogEventBegin/End.
> Good point 
>     *  The use of WaitForKokkos() is confusing and seems inconsistent. 
> I need to have a look. Until now, I have not paid much attention to kokkos 
> profiling.


  That is what is so great about Mark. He makes us do what we should have done 
before :-)


>              -For example it is used in VecTDot_SeqKokkos() which I would 
> think has a barrier anyways because it puts a scalar result into update? 
>              -Plus PetscLogGpuTimeBegin/End is suppose to already have 
> suitable system (that Hong added) to ensure the kernel is complete; reading 
> the manual page and looking at Jacobs cupmcontext.hpp it seems to be there so 
> I don't think WaitForKokkos() is needed in most places (or is Kokkos 
> asynchronous and needs this for correctness?) 
> But these won't explain the strange result of overall flop rate being higher 
> than GPU flop rate.
> 
>   Barry
> 
> 
> 
> 
> 
>> On Jan 22, 2022, at 11:44 AM, Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> 
>> I am getting some funny timings and I'm trying to figure it out.  
>> I figure the gPU flop rates are bit higher because the timers are inside of 
>> the CPU timers, but some are a lot bigger or inverted         
>> 
>> --- Event Stage 2: KSP Solve only
>> 
>> MatMult              400 1.0 1.0094e+01 1.2 1.07e+11 1.0 3.7e+05 6.1e+04 
>> 0.0e+00  2 55 62 54  0  68 91100100  0 671849   857147      0 0.00e+00    0 
>> 0.00e+00 100
>> MatView                2 1.0 4.5257e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 2.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> KSPSolve               2 1.0 1.4591e+01 1.1 1.18e+11 1.0 3.7e+05 6.1e+04 
>> 1.2e+03  2 60 62 54 60 100100100100100 512399   804048      0 0.00e+00    0 
>> 0.00e+00 100
>> SFPack               400 1.0 2.4545e-03 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> SFUnpack             400 1.0 9.4637e-05 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> VecTDot              802 1.0 3.0577e+00 2.1 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 8.0e+02  0  2  0  0 40  13  3  0  0 67 69996   488328      0 0.00e+00    0 
>> 0.00e+00 100
>> VecNorm              402 1.0 1.9597e+00 3.4 1.69e+09 1.0 0.0e+00 0.0e+00 
>> 4.0e+02  0  1  0  0 20   6  1  0  0 33 54744   571507      0 0.00e+00    0 
>> 0.00e+00 100
>> VecCopy                4 1.0 1.7143e-0228.6 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> VecSet                 4 1.0 3.8051e-0316.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> VecAXPY              800 1.0 8.6160e-0113.6 3.36e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  2  0  0  0   6  3  0  0  0 247787   448304      0 0.00e+00    0 
>> 0.00e+00 100
>> VecAYPX              398 1.0 1.6831e+0031.1 1.67e+09 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  1  0  0  0   5  1  0  0  0 63107   77030      0 0.00e+00    0 
>> 0.00e+00 100
>> VecPointwiseMult     402 1.0 3.8729e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138502   262413      0 0.00e+00    0 
>> 0.00e+00 100
>> VecScatterBegin      400 1.0 1.1947e+0035.1 0.00e+00 0.0 3.7e+05 6.1e+04 
>> 0.0e+00  0  0 62 54  0   5  0100100  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> VecScatterEnd        400 1.0 6.2969e+00 8.8 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0  10  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> PCApply              402 1.0 3.8758e-01 9.3 8.43e+08 1.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   2  1  0  0  0 138396   262413      0 0.00e+00    0 
>> 0.00e+00 100
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>> 
>> On Sat, Jan 22, 2022 at 11:10 AM Junchao Zhang <junchao.zh...@gmail.com 
>> <mailto:junchao.zh...@gmail.com>> wrote:
>> 
>> 
>> 
>> On Sat, Jan 22, 2022 at 10:04 AM Mark Adams <mfad...@lbl.gov 
>> <mailto:mfad...@lbl.gov>> wrote:
>> Logging GPU flops should be inside of PetscLogGpuTimeBegin()/End()  right?
>> No, PetscLogGpuTime() does not know the flops of the caller.
>>  
>> 
>> On Fri, Jan 21, 2022 at 9:47 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>   Mark,
>> 
>>   Fix the logging before you run more. It will help with seeing the true 
>> disparity between the MatMult and the vector ops.
>> 
>> 
>>> On Jan 21, 2022, at 9:37 PM, Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> wrote:
>>> 
>>> Here is one with 2M / GPU. Getting better.
>>> 
>>> On Fri, Jan 21, 2022 at 9:17 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>>    Matt is correct, vectors are way too small.
>>> 
>>>    BTW: Now would be a good time to run some of the Report I benchmarks on 
>>> Crusher to get a feel for the kernel launch times and performance on VecOps.
>>> 
>>>    Also Report 2.
>>> 
>>>   Barry
>>> 
>>> 
>>>> On Jan 21, 2022, at 7:58 PM, Matthew Knepley <knep...@gmail.com 
>>>> <mailto:knep...@gmail.com>> wrote:
>>>> 
>>>> On Fri, Jan 21, 2022 at 6:41 PM Mark Adams <mfad...@lbl.gov 
>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>> I am looking at performance of a CG/Jacobi solve on a 3D Q2 Laplacian 
>>>> (ex13) on one Crusher node (8 GPUs on 4 GPU sockets, MI250X or is it 
>>>> MI200?).
>>>> This is with a 16M equation problem. GPU-aware MPI and non GPU-aware MPI 
>>>> are similar (mat-vec is a little faster w/o, the total is about the same, 
>>>> call it noise)
>>>> 
>>>> I found that MatMult was about 3x faster using 8 cores/GPU, that is all 64 
>>>> cores on the node, then when using 1 core/GPU. With the same size problem 
>>>> of course.
>>>> I was thinking MatMult should be faster with just one MPI process. Oh 
>>>> well, worry about that later.
>>>> 
>>>> The bigger problem, and I have observed this to some extent with the 
>>>> Landau TS/SNES/GPU-solver on the V/A100s, is that the vector operations 
>>>> are expensive or crazy expensive.
>>>> You can see (attached) and the times here that the solve is dominated by 
>>>> not-mat-vec:
>>>> 
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop                        
>>>>       --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
>>>> GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
>>>> Count   Size  %F
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>>>> grep "MatMult              400" jac_out_00*5_8_gpuawaremp*
>>>> MatMult              400 1.0 1.2507e+00 1.3 1.34e+10 1.1 3.7e+05 1.6e+04 
>>>> 0.0e+00  1 55 62 54  0  27 91100100  0 668874       0      0 0.00e+00    0 
>>>> 0.00e+00 100
>>>> 17:15 main= /gpfs/alpine/csc314/scratch/adams/petsc/src/snes/tests/data$ 
>>>> grep "KSPSolve               2" jac_out_001*_5_8_gpuawaremp*
>>>> KSPSolve               2 1.0 4.4173e+00 1.0 1.48e+10 1.1 3.7e+05 1.6e+04 
>>>> 1.2e+03  4 60 62 54 61 100100100100100 208923   1094405      0 0.00e+00    
>>>> 0 0.00e+00 100
>>>> 
>>>> Notes about flop counters here, 
>>>> * that MatMult flops are not logged as GPU flops but something is logged 
>>>> nonetheless.
>>>> * The GPU flop rate is 5x the total flop rate  in KSPSolve :\
>>>> * I think these nodes have an FP64 peak flop rate of 200 Tflops, so we are 
>>>> at < 1%.
>>>> 
>>>> This looks complicated, so just a single remark:
>>>> 
>>>> My understanding of the benchmarking of vector ops led by Hannah was that 
>>>> you needed to be much
>>>> bigger than 16M to hit peak. I need to get the tech report, but on 8 GPUs 
>>>> I would think you would be
>>>> at 10% of peak or something right off the bat at these sizes. Barry, is 
>>>> that right?
>>>> 
>>>>   Thanks,
>>>> 
>>>>      Matt
>>>>  
>>>> Anway, not sure how to proceed but I thought I would share.
>>>> Maybe ask the Kokkos guys if the have looked at Crusher.
>>>> 
>>>> Mark
>>>> -- 
>>>> What most experimenters take for granted before they begin their 
>>>> experiments is infinitely more interesting than any results to which their 
>>>> experiments lead.
>>>> -- Norbert Wiener
>>>> 
>>>> https://www.cse.buffalo.edu/~knepley/ 
>>>> <http://www.cse.buffalo.edu/~knepley/>
>>> 
>>> <jac_out_001_kokkos_Crusher_6_8_gpuawarempi.txt>
>> 
>> <jac_out_001_kokkos_Crusher_5_8_notpl.txt><jac_out_001_kokkos_Crusher_6_8_notpl.txt>
>

Re: [petsc-dev] Kokkos/Crusher perforance

Reply via email to