1) It uses MatMPIDenseScatter() to move to the other ranks their needed > rows of the C matrix. That function has the call MatDenseGetArrayRead() > normally would trigger a copy of C up to the CPU each time. But since C is > not changing in your test run I guess it only triggers one copy. > > 2) If uses > MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr); > to do the off diagonal part of the product but this triggers for each > multiply a copy of the result matrix from the CPU to the GPU (hugely > expensive) > > For performance there needs to be a new routine > MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() > that is smarter about the needed MPI communication so it only moves exactly > what it needs to the other ranks and it does the off-diagonal part of the > product on the GPU so it does not need to copy the result up to the CPU. > > MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND
Rohan I would suggest to add PetscLogStage around your performance loop (do a warmup outside of it) and send the relevant portion of the log > Barry > > > > > > > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 > processors, by yadav2 Wed Feb 2 17:23:19 2022 > Using Petsc Release Version 3.16.3, unknown > > Max Max/Min Avg Total > Time (sec): 1.163e+02 1.000 1.163e+02 > Objects: 4.800e+01 1.000 4.800e+01 > Flop: 6.338e+11 1.065 6.144e+11 1.229e+12 > Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10 > MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 > MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09 > MPI Reductions: 8.100e+01 1.000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N --> > 2N flop > and VecAXPY() for complex vectors of length N --> > 8N flop > > Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- > -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total Count %Total > Avg %Total Count %Total > 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0% > 7.267e+07 100.0% 6.300e+01 77.8% > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on interpreting > output. > Phase summary info: > Count: number of times phase was executed > Time and Flop: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > AvgLen: average message length (bytes) > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flop in this phase > %M - percent messages in this phase %L - percent message lengths in > this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over > all processors) > GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU > time over all processors) > CpuToGpu Count: total number of CPU to GPU copies per processor > CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per > processor) > GpuToCpu Count: total number of GPU to CPU copies per processor > GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per > processor) > GPU %F: percent flops on GPU in this event > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - > GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count > Size %F > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > --- Event Stage 0: Main Stage > > BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 > 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 > 0.00e+00 0 > BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 > 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 > 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 > 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 > 8.73e+04 98 > MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 > 0.00e+00 0 > MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 > 0.00e+00 0 > MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 > 9.02e+04 0 > VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 > 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > Reports information only for process 0. > > --- Event Stage 0: Main Stage > > Matrix 37 30 2867511840 0. > Viewer 2 0 0 0. > Vector 4 1 1792 0. > Index Set 2 2 1495248 0. > Star Forest Graph 3 0 0 0. > ======================================================================================================================== > Average time to get PetscTime(): 3.83e-08 > Average time for MPI_Barrier(): 7.874e-07 > Average time for zero size MPI_Send(): 3.4035e-06 > #PETSc Option Table entries: > -bench spmm > -enable_gpu > -log_view > -mat_type aijcusparse > -matload_block_size 1 > -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc > -n 20 > -vec_type cuda > -warmup 10 > ``` > > > Thanks, > > > Rohan Yadav > > > > -- Stefano
