cusparse matrix triple product takes a lot of memory. We usually use Kokkos, configured with TPL turned off.
If you have a complex problem different parts of the domain can coarsen at different rates. Jacobi instead of asm will save a fair amount od memory. If you run with -ksp_view you will see operator/matrix complexity from GAMG. These should be < 1.5, Mark On Wed, Jan 18, 2023 at 3:42 PM Mark Lohry <[email protected]> wrote: > With asm I see a range of 8GB-13GB, slightly smaller ratio but that > probably explains it (does this still seem like a lot of memory to you for > the problem size?) > > In general I don't have the same number of blocks per row, so I suppose it > makes sense there's some memory imbalance. > > > > On Wed, Jan 18, 2023 at 3:35 PM Mark Adams <[email protected]> wrote: > >> Can your problem have load imbalance? >> >> You might try '-pc_type asm' (and/or jacobi) to see your baseline load >> imbalance. >> GAMG can add some load imbalance but start by getting a baseline. >> >> Mark >> >> On Wed, Jan 18, 2023 at 2:54 PM Mark Lohry <[email protected]> wrote: >> >>> Q0) does -memory_view trace GPU memory as well, or is there another >>> method to query the peak device memory allocation? >>> >>> Q1) I'm loading a aijcusparse matrix with MatLoad, and running with >>> -ksp_type fgmres -pc_type gamg -mg_levels_pc_type asm with mat info >>> 27,142,948 rows and cols, bs=4, total nonzeros 759,709,392. Using 8 ranks >>> on 8x80GB GPUs, and during the setup phase before crashing with >>> CUSPARSE_STATUS_INSUFFICIENT_RESOURCES nvidia-smi shows the below pasted >>> content. >>> >>> GPU memory usage spanning from 36GB-50GB but with one rank at 77GB. Is >>> this expected? Do I need to manually repartition this somehow? >>> >>> Thanks, >>> Mark >>> >>> >>> >>> +-----------------------------------------------------------------------------+ >>> >>> | Processes: >>> | >>> >>> | GPU GI CI PID Type Process name GPU >>> Memory | >>> >>> | ID ID >>> Usage | >>> >>> >>> |=============================================================================| >>> >>> | 0 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 0 N/A N/A 1696543 C ./petsc_solver_test >>> 38407MiB | >>> >>> | 0 N/A N/A 1696544 C ./petsc_solver_test >>> 467MiB | >>> >>> | 0 N/A N/A 1696545 C ./petsc_solver_test >>> 467MiB | >>> >>> | 0 N/A N/A 1696546 C ./petsc_solver_test >>> 467MiB | >>> >>> | 0 N/A N/A 1696548 C ./petsc_solver_test >>> 467MiB | >>> >>> | 0 N/A N/A 1696550 C ./petsc_solver_test >>> 471MiB | >>> >>> | 0 N/A N/A 1696551 C ./petsc_solver_test >>> 467MiB | >>> >>> | 0 N/A N/A 1696552 C ./petsc_solver_test >>> 467MiB | >>> >>> | 1 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 1 N/A N/A 1696544 C ./petsc_solver_test >>> 35849MiB | >>> >>> | 2 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 2 N/A N/A 1696545 C ./petsc_solver_test >>> 36719MiB | >>> >>> | 3 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 3 N/A N/A 1696546 C ./petsc_solver_test >>> 37343MiB | >>> >>> | 4 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 4 N/A N/A 1696548 C ./petsc_solver_test >>> 36935MiB | >>> >>> | 5 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 5 N/A N/A 1696550 C ./petsc_solver_test >>> 49953MiB | >>> >>> | 6 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 6 N/A N/A 1696551 C ./petsc_solver_test >>> 47693MiB | >>> >>> | 7 N/A N/A 1630309 C nvidia-cuda-mps-server >>> 27MiB | >>> >>> | 7 N/A N/A 1696552 C ./petsc_solver_test >>> 77331MiB | >>> >>> >>> +-----------------------------------------------------------------------------+ >>> >>
