> On Feb 3, 2022, at 12:11 AM, Rohan Yadav <[email protected]> wrote:
>
> Hi All,
>
> I'm trying to use the MatMatMult function with 1 sparse matrix B and two
> dense matrices A, C. I'm computing A = B * C.
>
> My code is below:
>
> ```
> void spmm(Mat B, int warmup, int niter) {
> Mat A, C;
> PetscInt i, j = 32, k;
> MatGetSize(B, &i, &k);
> MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, i, j,
> NULL, &A);
> MatCreateDenseCUDA(PETSC_COMM_WORLD, PETSC_DECIDE, PETSC_DECIDE, k, j,
> NULL, &C);
>
> // Initialize entries in the output.
> MatZeroEntries(A);
> setMatToConstant(C, 1.0);
>
> // Finally, do the computation.
> auto avgTime = benchmarkWithWarmup(warmup, niter, [&]() {
> MatMatMult(B, C, MAT_REUSE_MATRIX, PETSC_DEFAULT, &A);
> });
> PetscPrintf(PETSC_COMM_WORLD, "Average time: %lf ms.\n", avgTime * 1000);
> }
> ```
> where benchmarkWithWarmup is a simple wrapper function that runs a lambda
> several times.
>
> I'm running this function with arguments `-vec_type cuda -mat_type
> aijcusparse`,
These arguments are not appropriate; they are only for certain examples, you
shouldn't rely on them.
> and see that the performance is relatively slow. I'm wondering if I'm using
> the API incorrectly, or the computation is executing as expected. `nvprof`
> shows that much of the time is spent in a device to host memcpys:
Please send the code that builds the sparse B matrix and the
setMatToConstant() routine.
> ```
> Type Time(%) Time Calls Avg Min Max
> Name
> GPU activities: 87.32% 11.9978s 33 363.57ms 1.5040us 388.26ms
> [CUDA memcpy DtoH]
> 8.71% 1.19611s 30 39.870ms 37.421ms 39.976ms
> void cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
> bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>,
> int, double, double, double>(bool=0,
> cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
> bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>,
> int, double, double, double>,
> cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
> bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>,
> int, double, double, double>, bool=0 const *, bool=0 const , bool=0, bool=0,
> int, cusparseOperation_t,
> cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
> bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>,
> int, double, double, double> const *,
> cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
> bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>,
> int, double, double, double> const , unsigned int=8 const *, unsigned int=8
> const , cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128,
> bool=0, bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned
> int=0>, int, double, double, double>, unsigned int=16*,
> cusparse::csrmm_kernel<cusparse::CsrMMPolicy<unsigned int=128, bool=0,
> bool=0, unsigned int=8, unsigned int=16, unsigned int=4, unsigned int=0>,
> int, double, double, double>)
> 3.87% 531.56ms 14 37.968ms 1.0240us 227.29ms
> [CUDA memcpy HtoD]
> 0.07% 9.7452ms 6 1.6242ms 1.0880us 3.2481ms
> [CUDA memset]
> 0.02% 2.8727ms 1 2.8727ms 2.8727ms 2.8727ms
> void
> thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>,
> double>, unsigned long>,
> thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<double>,
> double>, unsigned long>(thrust::device_ptr<double>, double)
> 0.01% 1.4953ms 2 747.67us 56.188us 1.4392ms
> void
> thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::__parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>,
> int>, unsigned long>,
> thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<int>,
> int>, unsigned long>(thrust::device_ptr<int>, int)
> ```
> The logview output is:
> ```-----------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop
--- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct
%T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05
3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0
0.00e+00 0
MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07
0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30
8.73e+04 98
MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0
0.00e+00 0
MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0
0.00e+00 0
MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31
9.02e+04 0
From the third line we see the sparse matrix is copied to the GPU once, this is
good.
From line 4 a dense matrix is copied to the GPU once, this is good.
But from line 5 we see a dense matrix is copied from the GPU to the CPU 31
times! Looking at line 2 we see 30 copies from GPU to the CPU.
The flop rate on the GPU is 920,026 which is fine, but the flop rate for the
entire multiply time is a terrible 28,598, this is because this time includes
all the copies between the GPU and CPU and CPU and GPU.
So let's see if we can figure out why all these copies are taking place from
the GPU to the CPU.
But first please verify that if you run with one MPI rank the "on GPU" and the
overall flop rates for the MatMatMult() are almost the same and there is no
copy from the GPU for each multiply?
I think the parallel multiply is done with MatMatMultNumeric_MPIAIJ_MPIDense().
This code has two problems
1) It uses MatMPIDenseScatter() to move to the other ranks their needed rows of
the C matrix. That function has the call MatDenseGetArrayRead() normally would
trigger a copy of C up to the CPU each time. But since C is not changing in
your test run I guess it only triggers one copy.
2) If uses
MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr);
to do the off diagonal part of the product but this triggers for each multiply
a copy of the result matrix from the CPU to the GPU (hugely expensive)
For performance there needs to be a new routine
MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() that is smarter about the needed
MPI communication so it only moves exactly what it needs to the other ranks and
it does the off-diagonal part of the product on the GPU so it does not need to
copy the result up to the CPU.
Barry
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2
> processors, by yadav2 Wed Feb 2 17:23:19 2022
> Using Petsc Release Version 3.16.3, unknown
>
> Max Max/Min Avg Total
> Time (sec): 1.163e+02 1.000 1.163e+02
> Objects: 4.800e+01 1.000 4.800e+01
> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12
> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10
> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09
> MPI Reductions: 8.100e+01 1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flop
> and VecAXPY() for complex vectors of length N -->
> 8N flop
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count %Total
> Avg %Total Count %Total
> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0%
> 7.267e+07 100.0% 6.300e+01 77.8%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
> time over all processors)
> CpuToGpu Count: total number of CPU to GPU copies per processor
> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
> processor)
> GpuToCpu Count: total number of GPU to CPU copies per processor
> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
> processor)
> GPU %F: percent flops on GPU in this event
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu -
> GPU
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count
> Size %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00
> 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0
> 0.00e+00 0
> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07
> 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05
> 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07
> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30
> 8.73e+04 98
> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0
> 0.00e+00 0
> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0
> 0.00e+00 0
> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31
> 9.02e+04 0
> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05
> 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0
> 0.00e+00 0
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Matrix 37 30 2867511840 0.
> Viewer 2 0 0 0.
> Vector 4 1 1792 0.
> Index Set 2 2 1495248 0.
> Star Forest Graph 3 0 0 0.
> ========================================================================================================================
> Average time to get PetscTime(): 3.83e-08
> Average time for MPI_Barrier(): 7.874e-07
> Average time for zero size MPI_Send(): 3.4035e-06
> #PETSc Option Table entries:
> -bench spmm
> -enable_gpu
> -log_view
> -mat_type aijcusparse
> -matload_block_size 1
> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
> -n 20
> -vec_type cuda
> -warmup 10
> ```
>
> Thanks,
>
> Rohan Yadav
>