Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Barry Smith Wed, 22 Feb 2023 14:20:55 -0800


   Preonly means exactly one application of the PC so it will never converge by 
itself unless the PC is a full solver.


   Note there is a PCApplyRichardson_MG() that gets used automatically with 
KSPRICHARSON. This does not have an"extra" application of the preconditioner so 
2 iterations of Richardson with MG will use 2 applications of the V-cycle. So 
it is exactly "multigrid as a solver, without a Krylov method", no extra work. 
So I don't think you need to make any "compromises". 

  Barry



> On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley 
> <[email protected]> wrote:
> 
> Hi again,
> 
> I now found out that
> 
> 1. preonly ignores -ksp_pc_side right (makes sense, I guess).
> 2. richardson is incompatible with -ksp_pc_side right.
> 3. preonly gives less output for -log_view -pc_mg_log than richardson.
> 4. preonly also ignores -ksp_rtol etc..
> 5. preonly causes -log_view to measure incorrect timings for custom stages, 
> i.e. the time for the stage (219us) is significantly shorter than the time 
> for the KSPSolve inside the stage (~40ms).
> 
> Number 4 will be problematic as I want to benchmark number of V-cycles and 
> runtime for a given rtol. At the same time I want to avoid richardson now 
> because of number 2 and the additional work of scaling the RHS.
> 
> Is there any good way of just using MG V-cycles as a solver, i.e. without 
> interference from an outer Krylov solver and still iterate until convergence?
> Or will I just have to accept the additional V-cycle due to the left 
> application of th PC with richardson?
> 
> I guess I could also manually change -pc_mg_multiplicative_cycles until the 
> residual gets low enough (using preonly), but that seems very inefficient.
> 
> Best,
> Paul Große-Bley
> 
> 
> 
> On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" 
> <[email protected]> wrote:
>  
>> 
>> I was using the Richardson KSP type which I guess has the same behavior as 
>> GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use 
>> preonly from now on, where maxits=1 does what I want it to do.
>> 
>> Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I 
>> think for now all my problems are resolved for now. Thank you very much, 
>> Barry and Mark!
>> 
>> Best,
>> Paul Große-Bley
>> 
>> 
>> 
>> On Wednesday, February 22, 2023 21:03 CET, Barry Smith <[email protected]> 
>> wrote:
>>  
>>> 
>>>  
>>  
>>> 
>>> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley 
>>> <[email protected]> wrote:
>>>  
>>> Hi Barry,
>>> 
>>> I think most of my "weird" observations came from the fact that I looked at 
>>> iterations of KSPSolve where the residual was already converged. PCMG and 
>>> PCGAMG do one V-cycle before even taking a look at the residual and then 
>>> independent of pc_mg_multiplicative_cycles stop if it is converged.
>>> 
>>> Looking at iterations that are not converged with PCMG, 
>>> pc_mg_multiplicative_cycles works fine.
>>> 
>>> At these iterations I also see the multiple calls to PCApply in a single 
>>> KSPSolve iteration which were throwing me off with PCAMGX before.
>>> 
>>> The reason for these multiple applications of the preconditioner (tested 
>>> for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This 
>>> could be better documented, I think.
>>  
>>    I do not understand what you are talking about with regard to maxits of 1 
>> instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of 
>> meaningless.
>>  
>>    The reason that there is a PCApply at the start of the solve is because 
>> by default the KSPType is KSPGMRES which by default uses left preconditioner 
>> which means the right hand side needs to be scaled by the preconditioner 
>> before the KSP process starts. So in this configuration one KSP iteration 
>> results in 2 PCApply.  You can use -ksp_pc_side right to use right 
>> preconditioning and then the number of PCApply will match the number of KSP 
>> iterations.
>>> 
>>> 
>>> Best,
>>> Paul Große-Bley
>>> 
>>> 
>>> 
>>> On Wednesday, February 22, 2023 20:15 CET, Barry Smith <[email protected]> 
>>> wrote:
>>>  
>>>> 
>>>>  
>>>  
>>>> 
>>>> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley 
>>>> <[email protected]> wrote:
>>>>  
>>>> Hi Mark,
>>>> 
>>>> I use Nvidia Nsight Systems with --trace 
>>>> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX 
>>>> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and 
>>>> cuSPARSE calls (in addition to the actual kernels which are not always 
>>>> easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even 
>>>> more detailed NVTX markers.
>>>> 
>>>> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear 
>>>> because kernel runtimes on coarser levels are much shorter. At the 
>>>> coarsest level, there normally isn't even enough work for the GPU (Nvidia 
>>>> A100) to be fully occupied which is also visible in Nsight Systems.
>>>  
>>>   Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most 
>>> definitely it changes the run. I am not understanding why it would not work 
>>> for you. If you use and don't use the option are the exact same counts 
>>> listed for all the events in the -log_view ? 
>>>> 
>>>> 
>>>> I run only a single MPI rank with a single GPU, so profiling is 
>>>> straighforward.
>>>> 
>>>> Best,
>>>> Paul Große-Bley
>>>> 
>>>> On Wednesday, February 22, 2023 18:24 CET, Mark Adams <[email protected]> 
>>>> wrote:
>>>>  
>>>>> 
>>>>>  
>>>>>  
>>>>> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley 
>>>>> <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> Hi Barry,
>>>>>> 
>>>>>> after using VecCUDAGetArray to initialize the RHS, that kernel still 
>>>>>> gets called as part of KSPSolve instead of KSPSetup, but its runtime is 
>>>>>> way less significant than the cudaMemcpy before, so I guess I will leave 
>>>>>> it like this. Other than that I kept the code like in my first message 
>>>>>> in this thread (as you wrote, benchmark_ksp.c is not well suited for 
>>>>>> PCMG).
>>>>>> 
>>>>>> The profiling results for PCMG and PCAMG look as I would expect them to 
>>>>>> look, i.e. one can nicely see the GPU load/kernel runtimes going down 
>>>>>> and up again for each V-cycle.
>>>>>> 
>>>>>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem 
>>>>>> to make any difference. I would have expected to be able to increase the 
>>>>>> number of V-cycles per KSP iteration, but I keep seeing just a single 
>>>>>> V-cycle when changing the option (using PCMG).
>>>>>  
>>>>> How are you seeing this? 
>>>>> You might try -log_trace to see if you get two V cycles.
>>>>>  
>>>>>> 
>>>>>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess 
>>>>>> between bench iterations to reset the solution vector does not seem to 
>>>>>> work as the residual keeps shrinking. Is this a bug? Any advice for 
>>>>>> working around this?
>>>>>>  
>>>>>  
>>>>> Looking at the doc 
>>>>> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ 
>>>>> you use this with  KSPSetComputeRHS.
>>>>>  
>>>>> In src/snes/tests/ex13.c I just zero out the solution vector.
>>>>>   
>>>>>> The profile for BoomerAMG also doesn't really show the V-cycle behavior 
>>>>>> of the other implementations. Most of the runtime seems to go into calls 
>>>>>> to cusparseDcsrsv which might happen at the different levels, but the 
>>>>>> runtime of these kernels doesn't show the V-cycle pattern. According to 
>>>>>> the output with -pc_hypre_boomeramg_print_statistics it is doing the 
>>>>>> right thing though, so I guess it is alright (and if not, this is 
>>>>>> probably the wrong place to discuss it).
>>>>>> 
>>>>>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle 
>>>>>> behavior) calls in KSPSolve (three for the very first KSPSolve) while 
>>>>>> expecting just one. Each KSPSolve should do a single preconditioned 
>>>>>> Richardson iteration. Why is the preconditioner applied multiple times 
>>>>>> here?
>>>>>>  
>>>>>  
>>>>> Again, not sure what "see" is, but PCAMGX is pretty new and has not been 
>>>>> used much.
>>>>> Note some KSP methods apply to the PC before the iteration.
>>>>>  
>>>>> Mark 
>>>>>  
>>>>>> Thank you,
>>>>>> Paul Große-Bley
>>>>>> 
>>>>>> 
>>>>>> On Monday, February 06, 2023 20:05 CET, Barry Smith <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>  
>>>>>>> 
>>>>>>>  
>>>>>>  
>>>>>>  It should not crash, take a look at the test cases at the bottom of the 
>>>>>> file. You are likely correct if the code, unfortunately, does use 
>>>>>> DMCreateMatrix() it will not work out of the box for geometric 
>>>>>> multigrid. So it might be the wrong example for you.
>>>>>>  
>>>>>>   I don't know what you mean about clever. If you simply set the 
>>>>>> solution to zero at the beginning of the loop then it will just do the 
>>>>>> same solve multiple times. The setup should not do much of anything 
>>>>>> after the first solver.  Thought usually solves are big enough that one 
>>>>>> need not run solves multiple times to get a good understanding of their 
>>>>>> performance.
>>>>>>  
>>>>>>  
>>>>>>   
>>>>>>  
>>>>>>  
>>>>>>  
>>>>>>> 
>>>>>>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley 
>>>>>>> <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>  
>>>>>>> Hi Barry,
>>>>>>> 
>>>>>>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting 
>>>>>>> point, thank you! Sadly I get a segfault when executing that example 
>>>>>>> with PCMG and more than one level, i.e. with the minimal args:
>>>>>>> 
>>>>>>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2
>>>>>>> ===========================================
>>>>>>> Test: KSP performance - Poisson
>>>>>>>     Input matrix: 27-pt finite difference stencil
>>>>>>>     -n 100
>>>>>>>     DoFs = 1000000
>>>>>>>     Number of nonzeros = 26463592
>>>>>>> 
>>>>>>> Step1  - creating Vecs and Mat...
>>>>>>> Step2a - running PCSetUp()...
>>>>>>> [0]PETSC ERROR: 
>>>>>>> ------------------------------------------------------------------------
>>>>>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
>>>>>>> probably memory access out of range
>>>>>>> [0]PETSC ERROR: Try option -start_in_debugger or 
>>>>>>> -on_error_attach_debugger
>>>>>>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and 
>>>>>>> https://petsc.org/release/faq/
>>>>>>> [0]PETSC ERROR: or try 
>>>>>>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA 
>>>>>>> systems to find memory corruption errors
>>>>>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, 
>>>>>>> and run
>>>>>>> [0]PETSC ERROR: to get more information on the crash.
>>>>>>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is 
>>>>>>> causing the crash.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>>>>>>> with errorcode 59.
>>>>>>> 
>>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>>>>>>> You may or may not see output from other processes, depending on
>>>>>>> exactly when Open MPI kills them.
>>>>>>> --------------------------------------------------------------------------
>>>>>>> 
>>>>>>> As the matrix is not created using DMDACreate3d I expected it to fail 
>>>>>>> due to the missing geometric information, but I expected it to fail 
>>>>>>> more gracefully than with a segfault.
>>>>>>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG 
>>>>>>> preconditioning, especially since I am interested in the 7pt stencil 
>>>>>>> for now.
>>>>>>> 
>>>>>>> Concerning my benchmarking loop from before: Is it generally 
>>>>>>> discouraged to do this for KSPSolve due to PETSc cleverly/lazily 
>>>>>>> skipping some of the work when doing the same solve multiple times or 
>>>>>>> are the solves not iterated in bench_kspsolve.c (while the MatMuls are 
>>>>>>> with -matmult) just to keep the runtime short?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Paul
>>>>>>> 
>>>>>>> On Monday, February 06, 2023 17:04 CET, Barry Smith <[email protected] 
>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>  
>>>>>>>> 
>>>>>>>>  
>>>>>>>  
>>>>>>>   Paul,
>>>>>>>  
>>>>>>>    I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended 
>>>>>>> to be used for simple benchmarking. 
>>>>>>>  
>>>>>>>    You can use VecCudaGetArray() to access the GPU memory of the vector 
>>>>>>> and then call a CUDA kernel to compute the right hand side vector 
>>>>>>> directly on the GPU.
>>>>>>>  
>>>>>>>   Barry
>>>>>>>  
>>>>>>>  
>>>>>>>> 
>>>>>>>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley 
>>>>>>>> <[email protected] 
>>>>>>>> <mailto:[email protected]>> wrote:
>>>>>>>>  
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I want to compare different implementations of multigrid solvers for 
>>>>>>>> Nvidia GPUs using the poisson problem (starting from ksp tutorial 
>>>>>>>> example ex45.c).
>>>>>>>> Therefore I am trying to get runtime results comparable to hpgmg-cuda 
>>>>>>>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> 
>>>>>>>> (finite-volume), i.e. using multiple warmup and measurement solves and 
>>>>>>>> avoiding measuring setup time.
>>>>>>>> For now I am using -log_view with added stages:
>>>>>>>> 
>>>>>>>> PetscLogStageRegister("Solve Bench", &solve_bench_stage);
>>>>>>>>   for (int i = 0; i < BENCH_SOLVES; i++) {
>>>>>>>>     PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, 
>>>>>>>> NULL)); // reset x
>>>>>>>>     PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during 
>>>>>>>> solve
>>>>>>>>     PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that 
>>>>>>>> everything is done
>>>>>>>> 
>>>>>>>>     PetscLogStagePush(solve_bench_stage);
>>>>>>>>     PetscCall(KSPSolve(ksp, NULL, NULL));
>>>>>>>>     PetscLogStagePop();
>>>>>>>>   }
>>>>>>>> 
>>>>>>>> This snippet is preceded by a similar loop for warmup.
>>>>>>>> 
>>>>>>>> When profiling this using Nsight Systems, I see that the very first 
>>>>>>>> solve is much slower which mostly correspods to H2D (host to device) 
>>>>>>>> copies and e.g. cuBLAS setup (maybe also paging overheads as mentioned 
>>>>>>>> in the docs 
>>>>>>>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>,
>>>>>>>>  but probably insignificant in this case). The following solves have 
>>>>>>>> some overhead at the start from a H2D copy of a vector (the RHS I 
>>>>>>>> guess, as the copy is preceeded by a matrix-vector product) in the 
>>>>>>>> first MatResidual call (callchain: 
>>>>>>>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My 
>>>>>>>> interpretation of the profiling results (i.e. cuBLAS calls) is that 
>>>>>>>> that vector is overwritten with the residual in Daxpy and therefore 
>>>>>>>> has to be copied again for the next iteration.
>>>>>>>> 
>>>>>>>> Is there an elegant way of avoiding that H2D copy? I have seen some 
>>>>>>>> examples on constructing matrices directly on the GPU, but nothing 
>>>>>>>> about vectors. Any further tips for benchmarking (vs profiling) PETSc 
>>>>>>>> solvers? At the moment I am using jacobi as smoother, but I would like 
>>>>>>>> to have a CUDA implementation of SOR instead. Is there a good way of 
>>>>>>>> achieving that, e.g. using PCHYPREs boomeramg with a single level and 
>>>>>>>> "SOR/Jacobi"-smoother  as smoother in PCMG? Or is the overhead from 
>>>>>>>> constantly switching between PETSc and hypre too big?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Paul

Re: [petsc-users] MG on GPU: Benchmarking and avoiding vector host->device copy

Reply via email to