On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley < [email protected]> wrote:
> Hi Barry, > > after using VecCUDAGetArray to initialize the RHS, that kernel still gets > called as part of KSPSolve instead of KSPSetup, but its runtime is way less > significant than the cudaMemcpy before, so I guess I will leave it like > this. Other than that I kept the code like in my first message in this > thread (as you wrote, benchmark_ksp.c is not well suited for PCMG). > > The profiling results for PCMG and PCAMG look as I would expect them to > look, i.e. one can nicely see the GPU load/kernel runtimes going down and > up again for each V-cycle. > > I was wondering about -pc_mg_multiplicative_cycles as it does not seem to > make any difference. I would have expected to be able to increase the > number of V-cycles per KSP iteration, but I keep seeing just a single > V-cycle when changing the option (using PCMG). > How are you seeing this? You might try -log_trace to see if you get two V cycles. > > When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess > between bench iterations to reset the solution vector does not seem to work > as the residual keeps shrinking. Is this a bug? Any advice for working > around this? > > Looking at the doc https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ you use this with KSPSetComputeRHS. In src/snes/tests/ex13.c I just zero out the solution vector. > The profile for BoomerAMG also doesn't really show the V-cycle behavior of > the other implementations. Most of the runtime seems to go into calls to > cusparseDcsrsv which might happen at the different levels, but the runtime > of these kernels doesn't show the V-cycle pattern. According to the output > with -pc_hypre_boomeramg_print_statistics it is doing the right thing > though, so I guess it is alright (and if not, this is probably the wrong > place to discuss it). > When using PCAMGX, I see two PCApply (each showing a nice V-cycle > behavior) calls in KSPSolve (three for the very first KSPSolve) while > expecting just one. Each KSPSolve should do a single preconditioned > Richardson iteration. Why is the preconditioner applied multiple times here? > > Again, not sure what "see" is, but PCAMGX is pretty new and has not been used much. Note some KSP methods apply to the PC before the iteration. Mark > Thank you, > Paul Große-Bley > > > On Monday, February 06, 2023 20:05 CET, Barry Smith <[email protected]> > wrote: > > > > > > It should not crash, take a look at the test cases at the bottom of the > file. You are likely correct if the code, unfortunately, does use > DMCreateMatrix() it will not work out of the box for geometric multigrid. > So it might be the wrong example for you. > > I don't know what you mean about clever. If you simply set the solution > to zero at the beginning of the loop then it will just do the same solve > multiple times. The setup should not do much of anything after the first > solver. Thought usually solves are big enough that one need not run solves > multiple times to get a good understanding of their performance. > > > > > > > > On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley < > [email protected]> wrote: > > Hi Barry, > > src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting > point, thank you! Sadly I get a segfault when executing that example with > PCMG and more than one level, i.e. with the minimal args: > > $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 > =========================================== > Test: KSP performance - Poisson > Input matrix: 27-pt finite difference stencil > -n 100 > DoFs = 1000000 > Number of nonzeros = 26463592 > > Step1 - creating Vecs and Mat... > Step2a - running PCSetUp()... > [0]PETSC ERROR: > ------------------------------------------------------------------------ > [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and > https://petsc.org/release/faq/ > [0]PETSC ERROR: or try > https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA > systems to find memory corruption errors > [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and > run > [0]PETSC ERROR: to get more information on the crash. > [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is > causing the crash. > -------------------------------------------------------------------------- > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > with errorcode 59. > > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > You may or may not see output from other processes, depending on > exactly when Open MPI kills them. > -------------------------------------------------------------------------- > > As the matrix is not created using DMDACreate3d I expected it to fail due > to the missing geometric information, but I expected it to fail more > gracefully than with a segfault. > I will try to combine bench_kspsolve.c with ex45.c to get easy MG > preconditioning, especially since I am interested in the 7pt stencil for > now. > > Concerning my benchmarking loop from before: Is it generally discouraged > to do this for KSPSolve due to PETSc cleverly/lazily skipping some of the > work when doing the same solve multiple times or are the solves not > iterated in bench_kspsolve.c (while the MatMuls are with -matmult) just to > keep the runtime short? > > Thanks, > Paul > > On Monday, February 06, 2023 17:04 CET, Barry Smith <[email protected]> > wrote: > > > > > > Paul, > > I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended to > be used for simple benchmarking. > > You can use VecCudaGetArray() to access the GPU memory of the vector > and then call a CUDA kernel to compute the right hand side vector directly > on the GPU. > > Barry > > > > On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley < > [email protected]> wrote: > > Hi, > > I want to compare different implementations of multigrid solvers for > Nvidia GPUs using the poisson problem (starting from ksp tutorial example > ex45.c). > Therefore I am trying to get runtime results comparable to hpgmg-cuda > <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> > (finite-volume), i.e. using multiple warmup and measurement solves and > avoiding measuring setup time. > For now I am using -log_view with added stages: > > PetscLogStageRegister("Solve Bench", &solve_bench_stage); > for (int i = 0; i < BENCH_SOLVES; i++) { > PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, NULL)); > // reset x > PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during solve > PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that > everything is done > > PetscLogStagePush(solve_bench_stage); > PetscCall(KSPSolve(ksp, NULL, NULL)); > PetscLogStagePop(); > } > > This snippet is preceded by a similar loop for warmup. > > When profiling this using Nsight Systems, I see that the very first solve > is much slower which mostly correspods to H2D (host to device) copies and > e.g. cuBLAS setup (maybe also paging overheads as mentioned in the docs > <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>, > but probably insignificant in this case). The following solves have some > overhead at the start from a H2D copy of a vector (the RHS I guess, as the > copy is preceeded by a matrix-vector product) in the first MatResidual call > (callchain: > KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My > interpretation of the profiling results (i.e. cuBLAS calls) is that that > vector is overwritten with the residual in Daxpy and therefore has to be > copied again for the next iteration. > > Is there an elegant way of avoiding that H2D copy? I have seen some > examples on constructing matrices directly on the GPU, but nothing about > vectors. Any further tips for benchmarking (vs profiling) PETSc solvers? At > the moment I am using jacobi as smoother, but I would like to have a CUDA > implementation of SOR instead. Is there a good way of achieving that, e.g. > using PCHYPREs boomeramg with a single level and "SOR/Jacobi"-smoother as > smoother in PCMG? Or is the overhead from constantly switching between > PETSc and hypre too big? > > Thanks, > Paul > >
