Re: [petsc-users] PETSc GPU MatMatMult performance question

Barry Smith Thu, 03 Feb 2022 11:50:30 -0800

  Mark,

    Good eye. Something is definitely very different between this run and the 
previous (options, code change?). In the previously sent runs it was about 98% 
on GPU.


  Barry


> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <[email protected]> wrote:
> 
> >    Please send the code that builds the sparse B matrix and the 
> > setMatToConstant() routine.
> 
> Setting to a constant:
> ```
> void setMatToConstant(Mat mat, PetscScalar c) {
>   PetscInt rStart, rEnd, m, n;
>   MatGetSize(mat, &m, &n);
>   MatGetOwnershipRange(mat, &rStart, &rEnd);
>   for (int i = rStart; i < rEnd; i++) {
>     for (int j = 0; j < n; j++) {
>       MatSetValue(mat, i, j, c, INSERT_VALUES);
>     }
>   }
>   MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY);
>   MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY);
> }
> ```
> 
> Loading sparse matrix from disk:
> ```
> int loadMatrixFromFile(Mat* A, char* filename) {
>   auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr);
>   MatSetFromOptions(*A);
>   PetscViewer viewer;
>   PetscViewerCreate(PETSC_COMM_WORLD, &viewer);
>   PetscViewerSetType(viewer, PETSCVIEWERBINARY);
>   PetscViewerFileSetMode(viewer, FILE_MODE_READ);
>   PetscViewerFileSetName(viewer, filename);
>   MatLoad(*A, viewer);
>   return 0;
> }
> ```
> These are only called once and should not affect the computation in a loop 
> though.
> > But first please verify that if you run with one MPI rank the "on GPU" and 
> > the overall flop rates for the MatMatMult() are almost the same and there 
> > is no copy from the GPU for each multiply?
> 
> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I move 
> to 2 ranks I see this behavior.
> 
> Here are updated logs with a new stage for 2 ranks. I've staged the logs into 
> "MyComputation".
> 
> ```
> ---------------------------------------------- PETSc Performance Summary: 
> ----------------------------------------------
> 
> /g/g15/yadav2/taco/petsc/bin/benchmark on a  named lassen572 with 2 
> processors, by yadav2 Thu Feb  3 09:27:30 2022
> Using Petsc Release Version 3.16.3, unknown
> 
>                          Max       Max/Min     Avg       Total
> Time (sec):           2.091e+02     1.001   2.090e+02
> Objects:              4.800e+01     1.000   4.800e+01
> Flop:                 4.344e+11     1.019   4.303e+11  8.606e+11
> Flop/sec:             2.077e+09     1.018   2.059e+09  4.118e+09
> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
> MPI Message Lengths:  6.316e+10     1.000   1.805e+09  1.263e+11
> MPI Reductions:       8.100e+01     1.000
> 
> Flop counting convention: 1 flop = 1 real number operation of type 
> (multiply/divide/add/subtract)
>                             e.g., VecAXPY() for real vectors of length N --> 
> 2N flop
>                             and VecAXPY() for complex vectors of length N --> 
> 8N flop
> 
> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
> -- Message Lengths --  -- Reductions --
>                         Avg     %Total     Avg     %Total    Count   %Total   
>   Avg         %Total    Count   %Total
>  0:      Main Stage: 1.0555e+02  50.5%  2.8686e+11  33.3%  3.000e+01  42.9%  
> 1.466e+09       34.8%  4.300e+01  53.1%
>  1:   MyComputation: 1.0345e+02  49.5%  5.7373e+11  66.7%  4.000e+01  57.1%  
> 2.058e+09       65.2%  2.000e+01  24.7%
> 
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting 
> output.
> Phase summary info:
>    Count: number of times phase was executed
>    Time and Flop: Max - maximum over all processors
>                   Ratio - ratio of maximum to minimum over all processors
>    Mess: number of messages sent
>    AvgLen: average message length (bytes)
>    Reduct: number of global reductions
>    Global: entire computation
>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
> PetscLogStagePop().
>       %T - percent time in this phase         %F - percent flop in this phase
>       %M - percent messages in this phase     %L - percent message lengths in 
> this phase
>       %R - percent reductions in this phase
>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
> all processors)
>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU 
> time over all processors)
>    CpuToGpu Count: total number of CPU to GPU copies per processor
>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
> processor)
>    GpuToCpu Count: total number of GPU to CPU copies per processor
>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
> processor)
>    GPU %F: percent flops on GPU in this event
> ------------------------------------------------------------------------------------------------------------------------
> Event                Count      Time (sec)     Flop                           
>    --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - 
> GPU
>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   
> Size  %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> --- Event Stage 0: Main Stage
> 
> BuildTwoSided          2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00 
> 2.0e+00  0  0  3  0  2   0  0  7  0  5     0       0      0 0.00e+00    0 
> 0.00e+00  0
> BuildTwoSidedF         1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatAssemblyBegin      12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatAssemblyEnd        12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 6.0e+00  2  0  0  0  7   3  0  0  0 14     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatZeroEntries         1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatLoad                1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08 
> 2.1e+01  7  0  9  2 26  13  0 20  6 49     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatMatMultSym         20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07 
> 1.2e+01  2  0  6  0 15   3  0 13  0 28     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatMatMultNum         10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09 
> 0.0e+00 23 33 29 33  0  46100 67 94  0  5754   182686      2 2.23e+03   10 
> 2.08e+04  5
> MatCUSPARSCopyTo       1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 1.55e+02    0 
> 0.00e+00  0
> MatDenseCopyTo         1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 2.08e+03    0 
> 0.00e+00  0
> MatDenseCopyFrom      11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  1  0  0  0  0   3  0  0  0  0     0       0      0 0.00e+00   11 
> 2.29e+04  0
> VecSet                 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> SFSetGraph             1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> SFSetUp                1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07 
> 1.0e+00  0  0  6  0  1   0  0 13  0  2     0       0      0 0.00e+00    0 
> 0.00e+00  0
> 
> --- Event Stage 1: MyComputation
> 
> MatAssemblyBegin      20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatAssemblyEnd        20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatMatMultSym         40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 
> 2.0e+01  3  0  0  0 25   7  0  0  0100     0       0      0 0.00e+00    0 
> 0.00e+00  0
> MatMatMultNum         20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09 
> 0.0e+00 46 67 57 65  0  93100100100  0  5777   182577      0 0.00e+00   20 
> 4.16e+04  5
> MatDenseCopyFrom      20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  3  0  0  0  0   5  0  0  0  0     0       0      0 0.00e+00   20 
> 4.16e+04  0
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> 
> Memory usage is given in bytes:
> 
> Object Type          Creations   Destructions     Memory  Descendants' Mem.
> Reports information only for process 0.
> 
> --- Event Stage 0: Main Stage
> 
>               Matrix    17             10  20381695840     0.
>               Viewer     2              0            0     0.
>               Vector     4              1         1792     0.
>            Index Set     2              2     31848152     0.
>    Star Forest Graph     3              0            0     0.
> 
> --- Event Stage 1: MyComputation
> 
>               Matrix    20             20  40763391680     0.
> ========================================================================================================================
> Average time to get PetscTime(): 3.96e-08
> Average time for MPI_Barrier(): 8.184e-07
> Average time for zero size MPI_Send(): 2.8165e-06
> #PETSc Option Table entries:
> -bench spmm
> -enable_gpu
> -log_view
> -mat_type aijcusparse
> -matload_block_size 1
> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc
> -n 20
> -vec_type cuda
> -warmup 10
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --download-c2html=0 --download-hwloc=0 --download-sowing=0 
> --prefix=./petsc-install/ --with-64-bit-indices=0 
> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>  /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" 
> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>  --with-clanguage=C --with-cxx-dialect=C++17 
> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>  --with-cuda=1 --with-debugging=0 
> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>  --with-fftw=0 
> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>  --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 
> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 
> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 
> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so 
> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" 
> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= 
> PETSC_ARCH=arch-linux-c-opt
> -----------------------------------------
> Libraries compiled on 2022-01-21 06:41:50 on lassen111
> Machine characteristics: 
> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
> Using PETSc arch:
> -----------------------------------------
> 
> Using C compiler: 
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>  -g -DNoChange -fPIC "-O3"
> Using Fortran compiler: 
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>  -g -fPIC
> -----------------------------------------
> 
> Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include 
> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>  -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
> -----------------------------------------
> 
> Using C linker: 
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
> Using Fortran linker: 
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
> Using libraries: -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib 
> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc 
> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib 
> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib 
> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>  
> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>  -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 
> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 
> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>  -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib 
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>  -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc 
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc 
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib 
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl -lhdf5 
> -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse -lcusolver 
> -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi -lmpi_ibm_mpifh 
> -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread 
> -lquadmath -lstdc++ -ldl
> -----------------------------------------
> ```
> 
> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> 
> 1) It uses MatMPIDenseScatter() to move to the other ranks their needed rows 
> of the C matrix. That function has the call MatDenseGetArrayRead() normally 
> would trigger a copy of C up to the CPU each time. But since C is not 
> changing in your test run I guess it only triggers one copy.
> 
> 2) If uses 
> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr);
>  to do the off diagonal part of the product but this triggers for each 
> multiply a copy of the result matrix from the CPU to the GPU (hugely 
> expensive)
> 
> For performance there needs to be a new routine 
> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() that is smarter about the 
> needed MPI communication so it only moves exactly what it needs to the other 
> ranks and it does the off-diagonal part of the product on the GPU so it does 
> not need to copy the result up to the CPU. 
> 
> 
> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND
> 
> Rohan
> I would suggest to add PetscLogStage around your performance loop (do a 
> warmup outside of it) and send the relevant portion of the log
>  
> Barry
> 
> 
> 
> 
> 
> 
>> ---------------------------------------------- PETSc Performance Summary: 
>> ----------------------------------------------
>> 
>> /g/g15/yadav2/taco/petsc/bin/benchmark on a  named lassen457 with 2 
>> processors, by yadav2 Wed Feb  2 17:23:19 2022
>> Using Petsc Release Version 3.16.3, unknown
>> 
>>                          Max       Max/Min     Avg       Total
>> Time (sec):           1.163e+02     1.000   1.163e+02
>> Objects:              4.800e+01     1.000   4.800e+01
>> Flop:                 6.338e+11     1.065   6.144e+11  1.229e+12
>> Flop/sec:             5.451e+09     1.065   5.284e+09  1.057e+10
>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>> MPI Message Lengths:  2.544e+09     1.000   7.267e+07  5.087e+09
>> MPI Reductions:       8.100e+01     1.000
>> 
>> Flop counting convention: 1 flop = 1 real number operation of type 
>> (multiply/divide/add/subtract)
>>                             e.g., VecAXPY() for real vectors of length N --> 
>> 2N flop
>>                             and VecAXPY() for complex vectors of length N 
>> --> 8N flop
>> 
>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages ---  
>> -- Message Lengths --  -- Reductions --
>>                         Avg     %Total     Avg     %Total    Count   %Total  
>>    Avg         %Total    Count   %Total
>>  0:      Main Stage: 1.1628e+02 100.0%  1.2288e+12 100.0%  7.000e+01 100.0%  
>> 7.267e+07      100.0%  6.300e+01  77.8%
>> 
>> ------------------------------------------------------------------------------------------------------------------------
>> See the 'Profiling' chapter of the users' manual for details on interpreting 
>> output.
>> Phase summary info:
>>    Count: number of times phase was executed
>>    Time and Flop: Max - maximum over all processors
>>                   Ratio - ratio of maximum to minimum over all processors
>>    Mess: number of messages sent
>>    AvgLen: average message length (bytes)
>>    Reduct: number of global reductions
>>    Global: entire computation
>>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
>> PetscLogStagePop().
>>       %T - percent time in this phase         %F - percent flop in this phase
>>       %M - percent messages in this phase     %L - percent message lengths 
>> in this phase
>>       %R - percent reductions in this phase
>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
>> all processors)
>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU 
>> time over all processors)
>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
>> processor)
>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
>> processor)
>>    GPU %F: percent flops on GPU in this event
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu 
>> - GPU
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>> --- Event Stage 0: Main Stage
>> 
>> BuildTwoSided          2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 
>> 2.0e+00  0  0  3  0  2   0  0  3  0  3     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> BuildTwoSidedF         1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatAssemblyBegin      32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatAssemblyEnd        32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 6.0e+00  1  0  0  0  7   1  0  0  0 10     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatZeroEntries         1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatLoad                1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 
>> 2.1e+01 15  0  9-10 26  15  0  9-10 33     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatMatMultSym         60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 
>> 3.2e+01  1  0  6  0 40   1  0  6  0 51     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatMatMultNum         30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 
>> 0.0e+00 37100 86110  0  37100 86110  0 28598   920026      2 6.71e+03   30 
>> 8.73e+04 98
>> MatCUSPARSCopyTo       1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 3.80e+03    0 
>> 0.00e+00  0
>> MatDenseCopyTo         1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 2.91e+03    0 
>> 0.00e+00  0
>> MatDenseCopyFrom      31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     0       0      0 0.00e+00   31 
>> 9.02e+04  0
>> VecSet                 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> SFSetGraph             1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> SFSetUp                1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 
>> 1.0e+00  0  0  6  0  1   0  0  6  0  2     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>> Memory usage is given in bytes:
>> 
>> Object Type          Creations   Destructions     Memory  Descendants' Mem.
>> Reports information only for process 0.
>> 
>> --- Event Stage 0: Main Stage
>> 
>>               Matrix    37             30   2867511840     0.
>>               Viewer     2              0            0     0.
>>               Vector     4              1         1792     0.
>>            Index Set     2              2      1495248     0.
>>    Star Forest Graph     3              0            0     0.
>> ========================================================================================================================
>> Average time to get PetscTime(): 3.83e-08
>> Average time for MPI_Barrier(): 7.874e-07
>> Average time for zero size MPI_Send(): 3.4035e-06
>> #PETSc Option Table entries:
>> -bench spmm
>> -enable_gpu
>> -log_view
>> -mat_type aijcusparse
>> -matload_block_size 1
>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
>> -n 20
>> -vec_type cuda
>> -warmup 10
>> ```
>> 
>> Thanks,
>> 
>> Rohan Yadav
>> 
> 
> 
> 
> -- 
> Stefano

Re: [petsc-users] PETSc GPU MatMatMult performance question

Reply via email to