> > BTW, on unstructured grids, coloring requires a lot of colors and thus > many times more bandwidth (due to multiple passes) than the operator itself.
I've noticed -- in AMGx the multicolor GS was generally dramatically slower than jacobi because of lots of colors with few elements. You can use sparse triangular kernels like ILU (provided by cuBLAS), but > they are so mindbogglingly slow that you'll go back to the drawing board > and try to use a multigrid method of some sort with polynomial/point-block > smoothing. > I definitely need multigrid. I was under the impression that GAMG was relatively cuda-complete, is that not the case? What functionality works fully on GPU and what doesn't, without any host transfers (aside from what's needed for MPI)? If I use -ksp-pc_type gamg -mg_levels_pc_type pbjacobi -mg_levels_ksp_type richardson is that fully on device, but -mg_levels_pc_type ilu or -mg_levels_pc_type sor require transfers? On Tue, Jan 10, 2023 at 2:47 PM Jed Brown <[email protected]> wrote: > The joy of GPUs. You can use sparse triangular kernels like ILU (provided > by cuBLAS), but they are so mindbogglingly slow that you'll go back to the > drawing board and try to use a multigrid method of some sort with > polynomial/point-block smoothing. > > BTW, on unstructured grids, coloring requires a lot of colors and thus > many times more bandwidth (due to multiple passes) than the operator itself. > > Mark Lohry <[email protected]> writes: > > > Well that's suboptimal. What are my options for 100% GPU solves with no > > host transfers? > > > > On Tue, Jan 10, 2023, 2:23 PM Barry Smith <[email protected]> wrote: > > > >> > >> > >> On Jan 10, 2023, at 2:19 PM, Mark Lohry <[email protected]> wrote: > >> > >> Is DILU a point-block method? We have -pc_type pbjacobi (and vpbjacobi > if > >>> the node size is not uniform). The are good choices for > scale-resolving CFD > >>> on GPUs. > >>> > >> > >> I was hoping you'd know :) pbjacobi is underperforming ilu by a pretty > >> wide margin on some of the systems i'm looking at. > >> > >> We don't have colored smoothers currently in PETSc. > >>> > >> > >> So what happens under the hood when I run -mg_levels_pc_type sor on GPU? > >> Are you actually decomposing the matrix into lower and computing updates > >> with matrix multiplications? Or is it just the standard serial algorithm > >> with thread safety ignored? > >> > >> > >> It is running the regular SOR on the CPU and needs to copy up the > vector > >> and copy down the result. > >> > >> > >> On Tue, Jan 10, 2023 at 1:52 PM Barry Smith <[email protected]> wrote: > >> > >>> > >>> We don't have colored smoothers currently in PETSc. > >>> > >>> > On Jan 10, 2023, at 12:56 PM, Jed Brown <[email protected]> wrote: > >>> > > >>> > Is DILU a point-block method? We have -pc_type pbjacobi (and > vpbjacobi > >>> if the node size is not uniform). The are good choices for > scale-resolving > >>> CFD on GPUs. > >>> > > >>> > Mark Lohry <[email protected]> writes: > >>> > > >>> >> I'm running GAMG with CUDA, and I'm wondering how the nominally > serial > >>> >> smoother algorithms are implemented on GPU? Specifically SOR/GS and > >>> ILU(0) > >>> >> -- in e.g. AMGx these are applied by first creating a coloring, and > the > >>> >> smoother passes are done color by color. Is this how it's done in > >>> petsc AMG? > >>> >> > >>> >> Tangential, AMGx and OpenFOAM offer something called "DILU", > diagonal > >>> ILU. > >>> >> Is there an equivalent in petsc? > >>> >> > >>> >> Thanks, > >>> >> Mark > >>> > >>> > >> >
