I ran your code successfully with and without GPU-aware MPI. I see a bit of time in MatSetValue -- you can make it a bit faster using one MatSetValues call per row, but it's typical that assembling a matrix like this (sequentially on the host) will be more expensive than some unpreconditioned CG iterations (that don't come close to solving the problem -- use multigrid if you want to actually solve this problem).
Rohan Yadav <roh...@alumni.cmu.edu> writes: > Hi, > > I'm developing a microbenchmark that runs a CG solve using PETSc on a mesh > using a 5-point stencil matrix. My code (linked here: > https://github.com/rohany/petsc-pde-benchmark/blob/main/main.cpp, only 120 > lines) works on 1 GPU and has great performance. When I move to 2 GPUs, the > program appears to get stuck in the input generation. I've literred the > code with print statements and have found out the following clues: > > * The first rank progresses through this loop: > https://github.com/rohany/petsc-pde-benchmark/blob/main/main.cpp#L44, but > then does not exit (it seems to get stuck right before rowStart == rowEnd) > * The second rank makes very few iterations through the loop for its > allotted rows. > > Therefore, neither rank makes it to the call to MatAssemblyBegin. > > I'm running the code using the following command line on the Summit > supercomputer: > ``` > jsrun -n 2 -g 1 -c 1 -b rs -r 2 > /gpfs/alpine/scratch/rohany/csc335/petsc-pde-benchmark/main -ksp_max_it 200 > -ksp_type cg -pc_type none -ksp_atol 1e-10 -ksp_rtol 1e-10 -vec_type cuda > -mat_type aijcusparse -use_gpu_aware_mpi 0 -nx 8485 -ny 8485 > ``` > > Any suggestions will be appreciated! I feel that I have applied many of the > common petsc optimizations of preallocating my matrix row counts, so I'm > not sure what's going on with this input generation. > > Thanks, > > Rohan Yadav