I ran your code successfully with and without GPU-aware MPI. I see a bit of 
time in MatSetValue -- you can make it a bit faster using one MatSetValues call 
per row, but it's typical that assembling a matrix like this (sequentially on 
the host) will be more expensive than some unpreconditioned CG iterations (that 
don't come close to solving the problem -- use multigrid if you want to 
actually solve this problem).

Rohan Yadav <roh...@alumni.cmu.edu> writes:

> Hi,
>
> I'm developing a microbenchmark that runs a CG solve using PETSc on a mesh
> using a 5-point stencil matrix. My code (linked here:
> https://github.com/rohany/petsc-pde-benchmark/blob/main/main.cpp, only 120
> lines) works on 1 GPU and has great performance. When I move to 2 GPUs, the
> program appears to get stuck in the input generation. I've literred the
> code with print statements and have found out the following clues:
>
> * The first rank progresses through this loop:
> https://github.com/rohany/petsc-pde-benchmark/blob/main/main.cpp#L44, but
> then does not exit (it seems to get stuck right before rowStart == rowEnd)
> * The second rank makes very few iterations through the loop for its
> allotted rows.
>
> Therefore, neither rank makes it to the call to MatAssemblyBegin.
>
> I'm running the code using the following command line on the Summit
> supercomputer:
> ```
> jsrun -n 2 -g 1 -c 1 -b rs -r 2
> /gpfs/alpine/scratch/rohany/csc335/petsc-pde-benchmark/main -ksp_max_it 200
> -ksp_type cg -pc_type none -ksp_atol 1e-10 -ksp_rtol 1e-10 -vec_type cuda
> -mat_type aijcusparse -use_gpu_aware_mpi 0 -nx 8485 -ny 8485
> ```
>
> Any suggestions will be appreciated! I feel that I have applied many of the
> common petsc optimizations of preallocating my matrix row counts, so I'm
> not sure what's going on with this input generation.
>
> Thanks,
>
> Rohan Yadav

Reply via email to