Preonly means exactly one application of the PC so it will never converge by itself unless the PC is a full solver.
Note there is a PCApplyRichardson_MG() that gets used automatically with KSPRICHARSON. This does not have an"extra" application of the preconditioner so 2 iterations of Richardson with MG will use 2 applications of the V-cycle. So it is exactly "multigrid as a solver, without a Krylov method", no extra work. So I don't think you need to make any "compromises". Barry > On Feb 22, 2023, at 4:57 PM, Paul Grosse-Bley > <[email protected]> wrote: > > Hi again, > > I now found out that > > 1. preonly ignores -ksp_pc_side right (makes sense, I guess). > 2. richardson is incompatible with -ksp_pc_side right. > 3. preonly gives less output for -log_view -pc_mg_log than richardson. > 4. preonly also ignores -ksp_rtol etc.. > 5. preonly causes -log_view to measure incorrect timings for custom stages, > i.e. the time for the stage (219us) is significantly shorter than the time > for the KSPSolve inside the stage (~40ms). > > Number 4 will be problematic as I want to benchmark number of V-cycles and > runtime for a given rtol. At the same time I want to avoid richardson now > because of number 2 and the additional work of scaling the RHS. > > Is there any good way of just using MG V-cycles as a solver, i.e. without > interference from an outer Krylov solver and still iterate until convergence? > Or will I just have to accept the additional V-cycle due to the left > application of th PC with richardson? > > I guess I could also manually change -pc_mg_multiplicative_cycles until the > residual gets low enough (using preonly), but that seems very inefficient. > > Best, > Paul Große-Bley > > > > On Wednesday, February 22, 2023 21:26 CET, "Paul Grosse-Bley" > <[email protected]> wrote: > >> >> I was using the Richardson KSP type which I guess has the same behavior as >> GMRES here? I got rid of KSPSetComputeInitialGuess completely and will use >> preonly from now on, where maxits=1 does what I want it to do. >> >> Even BoomerAMG now shows the "v-cycle signature" I was looking for, so I >> think for now all my problems are resolved for now. Thank you very much, >> Barry and Mark! >> >> Best, >> Paul Große-Bley >> >> >> >> On Wednesday, February 22, 2023 21:03 CET, Barry Smith <[email protected]> >> wrote: >> >>> >>> >> >>> >>> On Feb 22, 2023, at 2:56 PM, Paul Grosse-Bley >>> <[email protected]> wrote: >>> >>> Hi Barry, >>> >>> I think most of my "weird" observations came from the fact that I looked at >>> iterations of KSPSolve where the residual was already converged. PCMG and >>> PCGAMG do one V-cycle before even taking a look at the residual and then >>> independent of pc_mg_multiplicative_cycles stop if it is converged. >>> >>> Looking at iterations that are not converged with PCMG, >>> pc_mg_multiplicative_cycles works fine. >>> >>> At these iterations I also see the multiple calls to PCApply in a single >>> KSPSolve iteration which were throwing me off with PCAMGX before. >>> >>> The reason for these multiple applications of the preconditioner (tested >>> for both PCMG and PCAMGX) is that I had set maxits to 1 instead of 0. This >>> could be better documented, I think. >> >> I do not understand what you are talking about with regard to maxits of 1 >> instead of 0. For KSP maxits of 1 means one iteration, 0 is kind of >> meaningless. >> >> The reason that there is a PCApply at the start of the solve is because >> by default the KSPType is KSPGMRES which by default uses left preconditioner >> which means the right hand side needs to be scaled by the preconditioner >> before the KSP process starts. So in this configuration one KSP iteration >> results in 2 PCApply. You can use -ksp_pc_side right to use right >> preconditioning and then the number of PCApply will match the number of KSP >> iterations. >>> >>> >>> Best, >>> Paul Große-Bley >>> >>> >>> >>> On Wednesday, February 22, 2023 20:15 CET, Barry Smith <[email protected]> >>> wrote: >>> >>>> >>>> >>> >>>> >>>> On Feb 22, 2023, at 1:10 PM, Paul Grosse-Bley >>>> <[email protected]> wrote: >>>> >>>> Hi Mark, >>>> >>>> I use Nvidia Nsight Systems with --trace >>>> cuda,nvtx,osrt,cublas-verbose,cusparse-verbose together with the NVTX >>>> markers that come with -log_view. I.e. I get a nice view of all cuBLAS and >>>> cuSPARSE calls (in addition to the actual kernels which are not always >>>> easy to attribute). For PCMG and PCGAMG I also use -pc_mg_log for even >>>> more detailed NVTX markers. >>>> >>>> The "signature" of a V-cycle in PCMG, PCGAMG and PCAMGX is pretty clear >>>> because kernel runtimes on coarser levels are much shorter. At the >>>> coarsest level, there normally isn't even enough work for the GPU (Nvidia >>>> A100) to be fully occupied which is also visible in Nsight Systems. >>> >>> Hmm, I run an example with -pc_mg_multiplicative_cycles 2 and most >>> definitely it changes the run. I am not understanding why it would not work >>> for you. If you use and don't use the option are the exact same counts >>> listed for all the events in the -log_view ? >>>> >>>> >>>> I run only a single MPI rank with a single GPU, so profiling is >>>> straighforward. >>>> >>>> Best, >>>> Paul Große-Bley >>>> >>>> On Wednesday, February 22, 2023 18:24 CET, Mark Adams <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>> >>>>> On Wed, Feb 22, 2023 at 11:15 AM Paul Grosse-Bley >>>>> <[email protected] >>>>> <mailto:[email protected]>> wrote: >>>>>> Hi Barry, >>>>>> >>>>>> after using VecCUDAGetArray to initialize the RHS, that kernel still >>>>>> gets called as part of KSPSolve instead of KSPSetup, but its runtime is >>>>>> way less significant than the cudaMemcpy before, so I guess I will leave >>>>>> it like this. Other than that I kept the code like in my first message >>>>>> in this thread (as you wrote, benchmark_ksp.c is not well suited for >>>>>> PCMG). >>>>>> >>>>>> The profiling results for PCMG and PCAMG look as I would expect them to >>>>>> look, i.e. one can nicely see the GPU load/kernel runtimes going down >>>>>> and up again for each V-cycle. >>>>>> >>>>>> I was wondering about -pc_mg_multiplicative_cycles as it does not seem >>>>>> to make any difference. I would have expected to be able to increase the >>>>>> number of V-cycles per KSP iteration, but I keep seeing just a single >>>>>> V-cycle when changing the option (using PCMG). >>>>> >>>>> How are you seeing this? >>>>> You might try -log_trace to see if you get two V cycles. >>>>> >>>>>> >>>>>> When using BoomerAMG from PCHYPRE, calling KSPSetComputeInitialGuess >>>>>> between bench iterations to reset the solution vector does not seem to >>>>>> work as the residual keeps shrinking. Is this a bug? Any advice for >>>>>> working around this? >>>>>> >>>>> >>>>> Looking at the doc >>>>> https://petsc.org/release/docs/manualpages/KSP/KSPSetComputeInitialGuess/ >>>>> you use this with KSPSetComputeRHS. >>>>> >>>>> In src/snes/tests/ex13.c I just zero out the solution vector. >>>>> >>>>>> The profile for BoomerAMG also doesn't really show the V-cycle behavior >>>>>> of the other implementations. Most of the runtime seems to go into calls >>>>>> to cusparseDcsrsv which might happen at the different levels, but the >>>>>> runtime of these kernels doesn't show the V-cycle pattern. According to >>>>>> the output with -pc_hypre_boomeramg_print_statistics it is doing the >>>>>> right thing though, so I guess it is alright (and if not, this is >>>>>> probably the wrong place to discuss it). >>>>>> >>>>>> When using PCAMGX, I see two PCApply (each showing a nice V-cycle >>>>>> behavior) calls in KSPSolve (three for the very first KSPSolve) while >>>>>> expecting just one. Each KSPSolve should do a single preconditioned >>>>>> Richardson iteration. Why is the preconditioner applied multiple times >>>>>> here? >>>>>> >>>>> >>>>> Again, not sure what "see" is, but PCAMGX is pretty new and has not been >>>>> used much. >>>>> Note some KSP methods apply to the PC before the iteration. >>>>> >>>>> Mark >>>>> >>>>>> Thank you, >>>>>> Paul Große-Bley >>>>>> >>>>>> >>>>>> On Monday, February 06, 2023 20:05 CET, Barry Smith <[email protected] >>>>>> <mailto:[email protected]>> wrote: >>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> It should not crash, take a look at the test cases at the bottom of the >>>>>> file. You are likely correct if the code, unfortunately, does use >>>>>> DMCreateMatrix() it will not work out of the box for geometric >>>>>> multigrid. So it might be the wrong example for you. >>>>>> >>>>>> I don't know what you mean about clever. If you simply set the >>>>>> solution to zero at the beginning of the loop then it will just do the >>>>>> same solve multiple times. The setup should not do much of anything >>>>>> after the first solver. Thought usually solves are big enough that one >>>>>> need not run solves multiple times to get a good understanding of their >>>>>> performance. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> On Feb 6, 2023, at 12:44 PM, Paul Grosse-Bley >>>>>>> <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> >>>>>>> Hi Barry, >>>>>>> >>>>>>> src/ksp/ksp/tutorials/bench_kspsolve.c is certainly the better starting >>>>>>> point, thank you! Sadly I get a segfault when executing that example >>>>>>> with PCMG and more than one level, i.e. with the minimal args: >>>>>>> >>>>>>> $ mpiexec -c 1 ./bench_kspsolve -split_ksp -pc_type mg -pc_mg_levels 2 >>>>>>> =========================================== >>>>>>> Test: KSP performance - Poisson >>>>>>> Input matrix: 27-pt finite difference stencil >>>>>>> -n 100 >>>>>>> DoFs = 1000000 >>>>>>> Number of nonzeros = 26463592 >>>>>>> >>>>>>> Step1 - creating Vecs and Mat... >>>>>>> Step2a - running PCSetUp()... >>>>>>> [0]PETSC ERROR: >>>>>>> ------------------------------------------------------------------------ >>>>>>> [0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, >>>>>>> probably memory access out of range >>>>>>> [0]PETSC ERROR: Try option -start_in_debugger or >>>>>>> -on_error_attach_debugger >>>>>>> [0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind and >>>>>>> https://petsc.org/release/faq/ >>>>>>> [0]PETSC ERROR: or try >>>>>>> https://docs.nvidia.com/cuda/cuda-memcheck/index.html on NVIDIA CUDA >>>>>>> systems to find memory corruption errors >>>>>>> [0]PETSC ERROR: configure using --with-debugging=yes, recompile, link, >>>>>>> and run >>>>>>> [0]PETSC ERROR: to get more information on the crash. >>>>>>> [0]PETSC ERROR: Run with -malloc_debug to check if memory corruption is >>>>>>> causing the crash. >>>>>>> -------------------------------------------------------------------------- >>>>>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD >>>>>>> with errorcode 59. >>>>>>> >>>>>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. >>>>>>> You may or may not see output from other processes, depending on >>>>>>> exactly when Open MPI kills them. >>>>>>> -------------------------------------------------------------------------- >>>>>>> >>>>>>> As the matrix is not created using DMDACreate3d I expected it to fail >>>>>>> due to the missing geometric information, but I expected it to fail >>>>>>> more gracefully than with a segfault. >>>>>>> I will try to combine bench_kspsolve.c with ex45.c to get easy MG >>>>>>> preconditioning, especially since I am interested in the 7pt stencil >>>>>>> for now. >>>>>>> >>>>>>> Concerning my benchmarking loop from before: Is it generally >>>>>>> discouraged to do this for KSPSolve due to PETSc cleverly/lazily >>>>>>> skipping some of the work when doing the same solve multiple times or >>>>>>> are the solves not iterated in bench_kspsolve.c (while the MatMuls are >>>>>>> with -matmult) just to keep the runtime short? >>>>>>> >>>>>>> Thanks, >>>>>>> Paul >>>>>>> >>>>>>> On Monday, February 06, 2023 17:04 CET, Barry Smith <[email protected] >>>>>>> <mailto:[email protected]>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Paul, >>>>>>> >>>>>>> I think src/ksp/ksp/tutorials/benchmark_ksp.c is the code intended >>>>>>> to be used for simple benchmarking. >>>>>>> >>>>>>> You can use VecCudaGetArray() to access the GPU memory of the vector >>>>>>> and then call a CUDA kernel to compute the right hand side vector >>>>>>> directly on the GPU. >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> On Feb 6, 2023, at 10:57 AM, Paul Grosse-Bley >>>>>>>> <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I want to compare different implementations of multigrid solvers for >>>>>>>> Nvidia GPUs using the poisson problem (starting from ksp tutorial >>>>>>>> example ex45.c). >>>>>>>> Therefore I am trying to get runtime results comparable to hpgmg-cuda >>>>>>>> <https://bitbucket.org/nsakharnykh/hpgmg-cuda/src/master/> >>>>>>>> (finite-volume), i.e. using multiple warmup and measurement solves and >>>>>>>> avoiding measuring setup time. >>>>>>>> For now I am using -log_view with added stages: >>>>>>>> >>>>>>>> PetscLogStageRegister("Solve Bench", &solve_bench_stage); >>>>>>>> for (int i = 0; i < BENCH_SOLVES; i++) { >>>>>>>> PetscCall(KSPSetComputeInitialGuess(ksp, ComputeInitialGuess, >>>>>>>> NULL)); // reset x >>>>>>>> PetscCall(KSPSetUp(ksp)); // try to avoid setup overhead during >>>>>>>> solve >>>>>>>> PetscCall(PetscDeviceContextSynchronize(dctx)); // make sure that >>>>>>>> everything is done >>>>>>>> >>>>>>>> PetscLogStagePush(solve_bench_stage); >>>>>>>> PetscCall(KSPSolve(ksp, NULL, NULL)); >>>>>>>> PetscLogStagePop(); >>>>>>>> } >>>>>>>> >>>>>>>> This snippet is preceded by a similar loop for warmup. >>>>>>>> >>>>>>>> When profiling this using Nsight Systems, I see that the very first >>>>>>>> solve is much slower which mostly correspods to H2D (host to device) >>>>>>>> copies and e.g. cuBLAS setup (maybe also paging overheads as mentioned >>>>>>>> in the docs >>>>>>>> <https://petsc.org/release/docs/manual/profiling/#accurate-profiling-and-paging-overheads>, >>>>>>>> but probably insignificant in this case). The following solves have >>>>>>>> some overhead at the start from a H2D copy of a vector (the RHS I >>>>>>>> guess, as the copy is preceeded by a matrix-vector product) in the >>>>>>>> first MatResidual call (callchain: >>>>>>>> KSPSolve->MatResidual->VecAYPX->VecCUDACopyTo->cudaMemcpyAsync). My >>>>>>>> interpretation of the profiling results (i.e. cuBLAS calls) is that >>>>>>>> that vector is overwritten with the residual in Daxpy and therefore >>>>>>>> has to be copied again for the next iteration. >>>>>>>> >>>>>>>> Is there an elegant way of avoiding that H2D copy? I have seen some >>>>>>>> examples on constructing matrices directly on the GPU, but nothing >>>>>>>> about vectors. Any further tips for benchmarking (vs profiling) PETSc >>>>>>>> solvers? At the moment I am using jacobi as smoother, but I would like >>>>>>>> to have a CUDA implementation of SOR instead. Is there a good way of >>>>>>>> achieving that, e.g. using PCHYPREs boomeramg with a single level and >>>>>>>> "SOR/Jacobi"-smoother as smoother in PCMG? Or is the overhead from >>>>>>>> constantly switching between PETSc and hypre too big? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Paul
