On Wed, Nov 12, 2025 at 1:31 AM Grant Chao <[email protected]> wrote:
> > Thank you for the suggestion. > > We have already tried running multiple CPU ranks with a single GPU. > However, we observed that as the number of ranks increases, the EPS solver > becomes significantly slower. We are not sure of the exact cause—could it > be due to process access contention, hidden data transfers, or perhaps > another reason? We would be very interested to hear your insight on this > matter. > Have you started the MPS, see https://urldefense.us/v3/__https://docs.nvidia.com/deploy/mps/index.html*starting-and-stopping-mps-on-linux__;Iw!!G_uCfscf7eWS!fRqGFSTH6neOLcmMT1alt2Uma1K1jVsAm1kXTHrg5nNNe-dVKOn6jIJvkO6q0AKcW9s3WvmnXT3jqrh2NFk1hBiuCBlC$ > > To avoid this problem, we used the gpu_comm approach mentioned before. > During testing, we noticed that the mapping between rank ID and GPU ID > seems to be set automatically and is not user-specifiable. > > For example, with 4 GPUs (0-3) and 8 CPU ranks (0-7), the program binds > ranks 0 and 4 to GPU 0, ranks 1 and 5 to GPU 1, and so on. > Yes, that is the current round-robin algorithm. Do you want ranks 0,1 on GPU 0, and ranks 2, 3 on GPU 1, and so on? > We tested possible solutions, such as calling cudaSetDevice() manually to > set rank 4 to device 1, but it did not work as expected. Ranks 0 and 4 > still used GPU 0. > > We would appreciate your guidance on how to customize this mapping. Thank > you for your support. > > Best wishes, > Grant > > > At 2025-11-12 11:48:47, "Junchao Zhang" <[email protected]>, said: > > Hi, Wenbo, > I think your approach should work. But before going this extra step > with gpu_comm, have you tried to map multiple MPI ranks (CPUs) to one GPU, > using nvidia's multiple process service (MPS)? If MPS works well, then > you can avoid the extra complexity. > > --Junchao Zhang > > > On Tue, Nov 11, 2025 at 7:50 PM Wenbo Zhao <[email protected]> > wrote: > >> Dear all, >> >> We are trying to solve ksp using GPUs. >> We found the example, src/ksp/ksp/tutorials/bench_kspsolve.c, in which >> the matrix is created and assembling using COO way provided by PETSc. In >> this example, the number of CPU is as same as the number of GPU. >> In our case, computation of the parameters of matrix is performed on >> CPUs. And the cost of it is expensive, which might take half of total time >> or even more. >> >> We want to use more CPUs to compute parameters in parallel. And a >> smaller communication domain (such as gpu_comm) for the CPUs corresponding >> to the GPUs is created. The parameters are computed by all of the CPUs (in >> MPI_COMM_WORLD). Then, the parameters are send to gpu_comm related CPUs via >> MPI. Matrix (type of aijcusparse) is then created and assembled within >> gpu_comm. Finally, ksp_solve is performed on GPUs. >> >> I’m not sure if this approach will work in practice. Are there any >> comparable examples I can look to for guidance? >> >> Best, >> Wenbo >> >
