You can run on AMD GPUs now with -dm_vec_type kokkos -dm_mat_type aijkokkos, for example. GAMG works that way, with the PtAP setup on device. If you use MatSetValuesCOO, then matrix assembly is also entirely on-device.
Justin Chang <jychan...@gmail.com> writes: > Hi Qi, Mark, > > My colleague Suyash Tandon has almost completed a PETSc HIP port > (essentially a hipification of the CUDA port) and has been trying to test > it on the same OpenFOAM 3D Lid-driven case. It would be interesting to see > what the optimal HYPRE parameters are as we could experiment from the AMD > side. > > Thanks, > Justin > > > On Mon, Mar 28, 2022 at 10:28 AM Qi Yang <qiy...@oakland.edu> wrote: > >> Hi Mark, >> >> Sure, I will try a 3D Lid-driven case by combining OpenFOAM, PETSc and >> HYPRE, let's see what would happen. >> >> Kind regards, >> Qi >> >> On Mon, Mar 28, 2022 at 11:04 PM Mark Adams <mfad...@lbl.gov> wrote: >> >>> Hi Qi, these are good discussions and data and we like to share, so let's >>> keep this on the list. >>> >>> * I would suggest you use a 3D test. This is more relevant to what HPC >>> applications do. >>> * In my experience, hypre's default parameters are tuned for 2D low order >>> problems like this so I would start with the defaults. I think they should >>> be fine for 3D also. >>> * As I think I said before we have an AMGx interface under >>> development and I heard yesterday that it should not be long until it is >>> available. It would be great if you could test that and we can work with >>> the NVIDIA developer to optimize it. We will let you know when >>> its available. >>> >>> Cheers, >>> Mark >>> >>> >>> On Mon, Mar 28, 2022 at 10:44 AM Qi Yang <qiy...@oakland.edu> wrote: >>> >>>> Hi Mark and Barry, >>>> >>>> Really appreciate your explanation about the setup process, those days I >>>> tried to use the HYPRE amg solver to replace the original amg solver in >>>> PETSc. >>>> >>>> The solver settings of HYPRE are as follows: >>>> mpiexec -n 1 ./ex50 -da_grid_x 3000 -da_grid_y 3000 -ksp_type cg >>>> -pc_type hypre -pc_hypre_type boomeramg -pc_hypre_boomeramg_max_iter 1 >>>> -pc_hypre_boomeramg_strong_threshold 0.7 >>>> -pc_hypre_boomeramg_grid_sweeps_up 1 -pc_hypre_boomeramg_grid_sweeps_down 1 >>>> -pc_hypre_boomeramg_agg_nl 2 -pc_hypre_boomeramg_agg_num_paths 1 >>>> -pc_hypre_boomeramg_max_levels 25 *-pc_hypre_boomeramg_coarsen_type >>>> PMIS* -pc_hypre_boomeramg_interp_type ext+i -pc_hypre_boomeramg_P_max 2 >>>> -pc_hypre_boomeramg_truncfactor 0.2 -vec_type cuda -mat_type aijcusparse >>>> -ksp_monitor -ksp_view -log-view >>>> >>>> [image: PMIS.PNG] >>>> >>>> The interesting part is that I choose the coarsen type as PMIS, through >>>> the code, you can find only PMIS has GPU codes(Host and Device). >>>> * HYPRE does reduce the solution time from 20s to 8s >>>> * The memory mapping process is found inside the solver process, which >>>> causes several gaps in the following NVIDIA Nsight System profile, I am not >>>> sure what does it mean, >>>> [image: image.png] >>>> I am really interested to do some benchmarks by using hypre amg solver, >>>> actually, I already connected OpenFOAM, PETSc, HYPRE and AMGX together by >>>> using the API >>>> petsc4foam( >>>> https://develop.openfoam.com/modules/external-solver/-/tree/amgxwrapper/src/petsc4Foam), >>>> I prefer to use PETSc as the base matrix solver for possible HIP code >>>> implementation in the future, that way, I can compare the difference >>>> between NVIDIA and AMD GPU. It seems there are many benchmark cases I can >>>> do in the future. >>>> >>>> Regards, >>>> Qi >>>> >>>> >>>> >>>> >>>> On Wed, Mar 23, 2022 at 9:39 AM Mark Adams <mfad...@lbl.gov> wrote: >>>> >>>>> A few points, but first this is a nice start. If you are interested in >>>>> working on benchmarking that would be great. If so, read on. >>>>> >>>>> * Barry pointed out the SOR issues that are thrashing the >>>>> memory system. This solve would run faster on the CPU (maybe, 9M eqs is a >>>>> lot). >>>>> * Most applications run for some time doing 100-1,000 and more solves >>>>> with one configuration and this amortizes the setup costs for each mesh. >>>>> What I call "mesh setup" cost. >>>>> * Many applications are nonlinear and use a full Newton solver that >>>>> does a "matrix setup" for each solve, but many applications can also >>>>> amortize this matrix setup (PtAP stuff in the output, which is small for >>>>> 2D >>>>> problem but can be large for 3D problems) >>>>> * Now hypre's mesh setup is definitely better that GAMG's and AMGx is >>>>> out of this world. >>>>> - AMGx is the result of a serious development effort by NVIDIA about >>>>> 15 years ago with many 10's of NVIDIA developer years in it (I am guessing >>>>> but I know it was a serious effort for a few years) >>>>> + We are currently working with the current AMG developer, Matt, to >>>>> provide an AMGx interface in PETSc, like hypre (DOE does not like us >>>>> working with non-portable solvers but AMGx is very good) >>>>> * Hypre and AMGx use "classic" AMG, which is like geometric multigrid >>>>> (fast) for M-matrices (very low order Laplacians, like ex50). >>>>> * GAMG uses "smoothed aggregation" AMG because this algorithm has >>>>> better theoretical properties for high order and elasticity problems and >>>>> the algorithm's implementations and default parameters have been optimized >>>>> for these types of problems. >>>>> >>>>> It would be interesting to add Hypre to your study (Ex50) and add a >>>>> high order 3D elasticity problem (eg, snes/tests/ex13, or Jed Brown has >>>>> some nice elasticity problems). >>>>> If you are interested we can give you Hypre parameters for elasticity >>>>> problems. >>>>> I have no experience with AMGx on elasticity but the NVIDIA developer >>>>> is available and can be looped in. >>>>> For that matter we could bring the main hypre developer, Ruipeng, in as >>>>> well. >>>>> I would also suggest timing the setup (you can combine mesh and matrix >>>>> if you like) and solve phase separately. ex13 does this and we should find >>>>> another 5-point stencil example that does this if ex50 does not. >>>>> >>>>> BTW, I have been intending to write a benchmarking paper this year with >>>>> Matt and Ruipeng, but I am just not getting around to it ... >>>>> If you want to lead a paper and the experiments, we can help optimize >>>>> and tune our solvers, setup tests, write background material, etc. >>>>> >>>>> Cheers, >>>>> Mark >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Tue, Mar 22, 2022 at 12:30 PM Barry Smith <bsm...@petsc.dev> wrote: >>>>> >>>>>> >>>>>> Indeed PCSetUp is taking most of the time (79%). In the version of >>>>>> PETSc you are running it is doing a great deal of the setup work on the >>>>>> CPU. You can see there is a lot of data movement between the CPU and GPU >>>>>> (in both directions) during the setup; 64 1.91e+03 54 1.21e+03 90 >>>>>> >>>>>> Clearly, we need help in porting all the parts of the GAMG setup that >>>>>> still occur on the CPU to the GPU. >>>>>> >>>>>> Barry >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Mar 22, 2022, at 12:07 PM, Qi Yang <qiy...@oakland.edu> wrote: >>>>>> >>>>>> Dear Barry, >>>>>> >>>>>> Your advice is helpful, now the total time reduce from 30s to 20s(now >>>>>> all matrix run on gpu), actually I have tried other settings for amg >>>>>> predicontioner, seems not help that a lot, like -pc_gamg_threshold 0.05 >>>>>> -pc_gamg_threshold_scale 0.5. >>>>>> it seems the key point is the PCSetup process, from the log, it takes >>>>>> the most time, and we can find from the new nsight system analysis, there >>>>>> is a big gap before the ksp solver starts, seems like the PCSetup >>>>>> process, >>>>>> not sure, am I right? >>>>>> <3.png> >>>>>> >>>>>> PCSetUp 2 1.0 1.5594e+01 1.0 3.06e+09 1.0 0.0e+00 >>>>>> 0.0e+00 0.0e+00 79 78 0 0 0 79 78 0 0 0 196 8433 64 >>>>>> 1.91e+03 54 1.21e+03 90 >>>>>> >>>>>> >>>>>> Regards, >>>>>> Qi >>>>>> >>>>>> On Tue, Mar 22, 2022 at 10:44 PM Barry Smith <bsm...@petsc.dev> wrote: >>>>>> >>>>>>> >>>>>>> It is using >>>>>>> >>>>>>> MatSOR 369 1.0 9.1214e+00 1.0 7.32e+09 1.0 0.0e+00 >>>>>>> 0.0e+00 0.0e+00 29 27 0 0 0 29 27 0 0 0 803 0 0 >>>>>>> 0.00e+00 565 1.35e+03 0 >>>>>>> >>>>>>> which runs on the CPU not the GPU hence the large amount of time in >>>>>>> memory copies and poor performance. We are switching the default to be >>>>>>> Chebyshev/Jacobi which runs completely on the GPU (may already be >>>>>>> switched >>>>>>> in the main branch). >>>>>>> >>>>>>> You can run with -mg_levels_pc_type jacobi You should then see >>>>>>> almost the entire solver running on the GPU. >>>>>>> >>>>>>> You may need to tune the number of smoothing steps or other >>>>>>> parameters of GAMG to get the faster solution time. >>>>>>> >>>>>>> Barry >>>>>>> >>>>>>> >>>>>>> On Mar 22, 2022, at 10:30 AM, Qi Yang <qiy...@oakland.edu> wrote: >>>>>>> >>>>>>> To whom it may concern, >>>>>>> >>>>>>> I have tried petsc ex50(Possion) with cuda, ksp cg solver and >>>>>>> gamg precondition, however, it run for about 30s. I also tried NVIDIA >>>>>>> AMGX >>>>>>> with the same solver and same grid (3000*3000), it only took 2s. I used >>>>>>> nsight system software to analyze those two cases, found petsc took much >>>>>>> time in the memory process (63% of total time, however, amgx only took >>>>>>> 19%). Attached are screenshots of them. >>>>>>> >>>>>>> The petsc command is : mpiexec -n 1 ./ex50 -da_grid_x 3000 >>>>>>> -da_grid_y 3000 -ksp_type cg -pc_type gamg -pc_gamg_type agg >>>>>>> -pc_gamg_agg_nsmooths 1 -vec_type cuda -mat_type aijcusparse >>>>>>> -ksp_monitor >>>>>>> -ksp_view -log-view >>>>>>> >>>>>>> The log file is also attached. >>>>>>> >>>>>>> Regards, >>>>>>> Qi >>>>>>> >>>>>>> <1.png> >>>>>>> <2.png> >>>>>>> <log.PETSc_cg_amg_ex50_gpu_cuda> >>>>>>> >>>>>>> >>>>>>> <log.PETSc_cg_amg_jacobi_ex50_gpu_cuda> >>>>>> >>>>>> >>>>>>