Thank you for the suggestion.


We have already tried running multiple CPU ranks with a single GPU. However, we 
observed that as the number of ranks increases, the EPS solver becomes 
significantly slower. We are not sure of the exact cause—could it be due to 
process access contention, hidden data transfers, or perhaps another reason? We 
would be very interested to hear your insight on this matter.


To avoid this problem, we used the gpu_comm approach mentioned before. During 
testing, we noticed that the mapping between rank ID and GPU ID seems to be set 
automatically and is not user-specifiable.


For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds ranks 0 
and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on.


We tested possible solutions, such as calling cudaSetDevice() manually to set 
rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 still used 
GPU 0.


We would appreciate your guidance on how to customize this mapping. Thank you 
for your support.


Best wishes,
Grant


At 2025-11-12 11:48:47, "Junchao Zhang" <[email protected]>, said:

Hi, Wenbo,
   I think your approach should work.  But before going this extra step with 
gpu_comm,  have you tried to map multiple MPI ranks (CPUs) to one GPU, using 
nvidia's multiple process service (MPS)?  If MPS works well,  then you can 
avoid the extra complexity. 


--Junchao Zhang




On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected]> wrote:

Dear all,


We are trying to solve ksp using GPUs.
We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which the 
matrix is created and assembling using COO way provided by PETSc. In this 
example, the number of CPU is as same as the number of GPU.
In our case, computation of the parameters of matrix is performed on CPUs. And 
the cost of it is expensive, which might take half of total time or even more. 


 We want to use more CPUs to compute parameters in parallel. And a smaller 
communication domain (such as gpu_comm) for the CPUs corresponding to the GPUs 
is created. The parameters are computed by all of the CPUs (in MPI_COMM_WORLD). 
Then, the parameters are send to gpu_comm related CPUs via MPI. Matrix (type of 
aijcusparse) is then created and assembled within gpu_comm. Finally, ksp_solve 
is performed on GPUs.


I’m not sure if this approach will work in practice. Are there any comparable 
examples I can look to for guidance?


Best,
Wenbo

Reply via email to