Re: [petsc-users] PETSc GPU MatMatMult performance question

Rohan Yadav Thu, 03 Feb 2022 15:21:13 -0800

Alright, thanks for the help everyone.

Rohan


On Thu, Feb 3, 2022 at 2:01 PM Barry Smith <[email protected]> wrote:

>
>
> On Feb 3, 2022, at 4:28 PM, Rohan Yadav <[email protected]> wrote:
>
> To be concrete, the first matrix was
> https://sparse.tamu.edu/LAW/arabic-2005 and the second was
> https://sparse.tamu.edu/Schenk/nlpkkt200 (which looks like it does come
> from the PDE domain?).
>
>
>   You are correct; but the matrix bandwidth is so huge (see how far that
> off diagonal that second band of nonzeros is is) for two ranks this means
> the each MPI rank ends up needing the entire right hand side matrix to do
> the computation. Plus the 0 in the lower diagonal block means that on the
> second rank there is essentially no work to be done on the GPU at all.
>
>
>
> Regardless of the non-zero structure, there is still a significant hit
> when moving from 1 gpu to multiple GPUs that causes a large number of
> device to host copies to be performed. If this is a result of the PETSc
> implementation thats fine -- but if there's something I can do to work
> around that it would be great.
>
>
>   I don't understand enough about the code that Stefeno pointed to know
> how easy the performance problem would be to fix in PETSc for sparse times
> dense matrix product It would still be a problem for the nlpkkt200
> <https://sparse.tamu.edu/Schenk/nlpkkt200>  as partitioned on two ranks
> no matter what, but in theory the problem can be fixed by improving the
> PETSc code.
>
>   I recommend finding a different sparse matrix test case which has an
> appropriate nonzero data structuring and partitioning that one can expect
> good performance on.
>
>
>
>
> Rohan
>
> On Thu, Feb 3, 2022 at 1:25 PM Barry Smith <[email protected]> wrote:
>
>>
>>   I suspect the new matrix has a very different parallel nonzero
>> structure that results in MOST of the calculations taking place on the CPU
>> (since the "off-diagonal" part of the matrix dominates the non-zero
>> pattern). PETSc is not designed for this type of nonzero structure and will
>> give a bad performance (CPU or GPU); it is not a "PDE-ish" type of nonzero
>> structure.
>>
>>
>>
>> On Feb 3, 2022, at 2:59 PM, Rohan Yadav <[email protected]> wrote:
>>
>> I'm sorry, I did a little switch here. The original log view I sent for 2
>> runs was on a different input matrix. Based on Barry's request I switched
>> to a different matrix as the original one did not fit on 1 GPU.
>>
>> >  In the previously sent runs it was about 98% on GPU.
>>
>> Re 98% on the GPU though, my first email had a similar ratio in the log
>> though:
>> ```
>>
>> MatMatMultNum         30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 
>> 0.0e+00 37100 86110  0  37100 86110  0 28598   920026      2 6.71e+03   30 
>> 8.73e+04 98
>>
>> ```
>>
>> The follow up log might be slightly different as well because I pushed a new 
>> log stage as requested by Stefano.
>>
>>
>> Rohan
>>
>>
>> On Thu, Feb 3, 2022 at 11:50 AM Barry Smith <[email protected]> wrote:
>>
>>>
>>>   Mark,
>>>
>>>     Good eye. Something is definitely very different between this run
>>> and the previous (options, code change?). In the previously sent runs it
>>> was about 98% on GPU.
>>>
>>>   Barry
>>>
>>>
>>> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <[email protected]> wrote:
>>>
>>> >    Please send the code that builds the sparse B matrix and the 
>>> > setMatToConstant()
>>> routine.
>>>
>>> Setting to a constant:
>>> ```
>>> void setMatToConstant(Mat mat, PetscScalar c) {
>>>
>>>   PetscInt rStart, rEnd, m, n;
>>>   MatGetSize(mat, &m, &n);
>>>   MatGetOwnershipRange(mat, &rStart, &rEnd);
>>>   for (int i = rStart; i < rEnd; i++) {
>>>     for (int j = 0; j < n; j++) {
>>>       MatSetValue(mat, i, j, c, INSERT_VALUES);
>>>     }
>>>   }
>>>   MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY);
>>>   MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY);
>>> }
>>> ```
>>>
>>> Loading sparse matrix from disk:
>>>
>>> ```
>>>
>>> int loadMatrixFromFile(Mat* A, char* filename) {
>>>   auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr);
>>>   MatSetFromOptions(*A);
>>>   PetscViewer viewer;
>>>   PetscViewerCreate(PETSC_COMM_WORLD, &viewer);
>>>   PetscViewerSetType(viewer, PETSCVIEWERBINARY);
>>>   PetscViewerFileSetMode(viewer, FILE_MODE_READ);
>>>   PetscViewerFileSetName(viewer, filename);
>>>   MatLoad(*A, viewer);
>>>   return 0;
>>> }
>>>
>>> ```
>>>
>>> These are only called once and should not affect the computation in a loop 
>>> though.
>>>
>>> > But first please verify that if you run with one MPI rank the "on GPU" 
>>> > and the overall flop rates for the MatMatMult() are almost the same and 
>>> > there is no copy from the GPU for each multiply?
>>>
>>>
>>> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I
>>> move to 2 ranks I see this behavior.
>>>
>>> Here are updated logs with a new stage for 2 ranks. I've staged the logs
>>> into "MyComputation".
>>>
>>> ```
>>> ---------------------------------------------- PETSc Performance
>>> Summary: ----------------------------------------------
>>>
>>> /g/g15/yadav2/taco/petsc/bin/benchmark on a  named lassen572 with 2
>>> processors, by yadav2 Thu Feb  3 09:27:30 2022
>>> Using Petsc Release Version 3.16.3, unknown
>>>
>>>                          Max       Max/Min     Avg       Total
>>> Time (sec):           2.091e+02     1.001   2.090e+02
>>> Objects:              4.800e+01     1.000   4.800e+01
>>> Flop:                 4.344e+11     1.019   4.303e+11  8.606e+11
>>> Flop/sec:             2.077e+09     1.018   2.059e+09  4.118e+09
>>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>>> MPI Message Lengths:  6.316e+10     1.000   1.805e+09  1.263e+11
>>> MPI Reductions:       8.100e+01     1.000
>>>
>>> Flop counting convention: 1 flop = 1 real number operation of type
>>> (multiply/divide/add/subtract)
>>>                             e.g., VecAXPY() for real vectors of length N
>>> --> 2N flop
>>>                             and VecAXPY() for complex vectors of length
>>> N --> 8N flop
>>>
>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages
>>> ---  -- Message Lengths --  -- Reductions --
>>>                         Avg     %Total     Avg     %Total    Count
>>> %Total     Avg         %Total    Count   %Total
>>>  0:      Main Stage: 1.0555e+02  50.5%  2.8686e+11  33.3%  3.000e+01
>>>  42.9%  1.466e+09       34.8%  4.300e+01  53.1%
>>>  1:   MyComputation: 1.0345e+02  49.5%  5.7373e+11  66.7%  4.000e+01
>>>  57.1%  2.058e+09       65.2%  2.000e+01  24.7%
>>>
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on
>>> interpreting output.
>>> Phase summary info:
>>>    Count: number of times phase was executed
>>>    Time and Flop: Max - maximum over all processors
>>>                   Ratio - ratio of maximum to minimum over all processors
>>>    Mess: number of messages sent
>>>    AvgLen: average message length (bytes)
>>>    Reduct: number of global reductions
>>>    Global: entire computation
>>>    Stage: stages of a computation. Set stages with PetscLogStagePush()
>>> and PetscLogStagePop().
>>>       %T - percent time in this phase         %F - percent flop in this
>>> phase
>>>       %M - percent messages in this phase     %L - percent message
>>> lengths in this phase
>>>       %R - percent reductions in this phase
>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time
>>> over all processors)
>>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max
>>> GPU time over all processors)
>>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>>> processor)
>>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>>> processor)
>>>    GPU %F: percent flops on GPU in this event
>>>
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop
>>>          --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   -
>>> GpuToCpu - GPU
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen
>>>  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size
>>> Count   Size  %F
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>> BuildTwoSided          2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00
>>> 2.0e+00  0  0  3  0  2   0  0  7  0  5     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> BuildTwoSidedF         1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0
>>> 0.00e+00    0 0.00e+00  0
>>> MatAssemblyBegin      12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00
>>> 0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0
>>> 0.00e+00    0 0.00e+00  0
>>> MatAssemblyEnd        12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 6.0e+00  2  0  0  0  7   3  0  0  0 14     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatZeroEntries         1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatLoad                1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08
>>> 2.1e+01  7  0  9  2 26  13  0 20  6 49     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatMatMultSym         20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07
>>> 1.2e+01  2  0  6  0 15   3  0 13  0 28     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatMatMultNum         10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09
>>> 0.0e+00 23 33 29 33  0  46100 67 94  0  5754   182686      2 2.23e+03   10
>>> 2.08e+04  5
>>> MatCUSPARSCopyTo       1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 1.55e+02    0
>>> 0.00e+00  0
>>> MatDenseCopyTo         1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 2.08e+03    0
>>> 0.00e+00  0
>>> MatDenseCopyFrom      11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  1  0  0  0  0   3  0  0  0  0     0       0      0 0.00e+00   11
>>> 2.29e+04  0
>>> VecSet                 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> SFSetGraph             1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> SFSetUp                1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07
>>> 1.0e+00  0  0  6  0  1   0  0 13  0  2     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>>
>>> --- Event Stage 1: MyComputation
>>>
>>> MatAssemblyBegin      20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatAssemblyEnd        20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatMatMultSym         40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 2.0e+01  3  0  0  0 25   7  0  0  0100     0       0      0 0.00e+00    0
>>> 0.00e+00  0
>>> MatMatMultNum         20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09
>>> 0.0e+00 46 67 57 65  0  93100100100  0  5777   182577      0 0.00e+00   20
>>> 4.16e+04  5
>>> MatDenseCopyFrom      20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  3  0  0  0  0   5  0  0  0  0     0       0      0 0.00e+00   20
>>> 4.16e+04  0
>>>
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>
>>> Memory usage is given in bytes:
>>>
>>> Object Type          Creations   Destructions     Memory  Descendants'
>>> Mem.
>>> Reports information only for process 0.
>>>
>>> --- Event Stage 0: Main Stage
>>>
>>>               Matrix    17             10  20381695840     0.
>>>               Viewer     2              0            0     0.
>>>               Vector     4              1         1792     0.
>>>            Index Set     2              2     31848152     0.
>>>    Star Forest Graph     3              0            0     0.
>>>
>>> --- Event Stage 1: MyComputation
>>>
>>>               Matrix    20             20  40763391680     0.
>>>
>>> ========================================================================================================================
>>> Average time to get PetscTime(): 3.96e-08
>>> Average time for MPI_Barrier(): 8.184e-07
>>> Average time for zero size MPI_Send(): 2.8165e-06
>>> #PETSc Option Table entries:
>>> -bench spmm
>>> -enable_gpu
>>> -log_view
>>> -mat_type aijcusparse
>>> -matload_block_size 1
>>> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc
>>> -n 20
>>> -vec_type cuda
>>> -warmup 10
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure options: --download-c2html=0 --download-hwloc=0
>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0
>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> --with-clanguage=C --with-cxx-dialect=C++17
>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>>> --with-cuda=1 --with-debugging=0
>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> --with-fftw=0
>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
>>> PETSC_ARCH=arch-linux-c-opt
>>> -----------------------------------------
>>> Libraries compiled on 2022-01-21 06:41:50 on lassen111
>>> Machine characteristics:
>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>> Using PETSc arch:
>>> -----------------------------------------
>>>
>>> Using C compiler:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> -g -DNoChange -fPIC "-O3"
>>> Using Fortran compiler:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> -g -fPIC
>>> -----------------------------------------
>>>
>>> Using include paths:
>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include
>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>> -----------------------------------------
>>>
>>> Using C linker:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> Using Fortran linker:
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> Using libraries:
>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib
>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc
>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64
>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl
>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse
>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi
>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath
>>> -lpthread -lquadmath -lstdc++ -ldl
>>> -----------------------------------------
>>> ```
>>>
>>> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini <
>>> [email protected]> wrote:
>>>
>>>>
>>>>
>>>> 1) It uses MatMPIDenseScatter() to move to the other ranks their
>>>>> needed rows of the C matrix. That function has the call
>>>>> MatDenseGetArrayRead() normally would trigger a copy of C up to the CPU
>>>>> each time. But since C is not changing in your test run I guess it only
>>>>> triggers one copy.
>>>>>
>>>>> 2) If uses
>>>>> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr);
>>>>> to do the off diagonal part of the product but this triggers for each
>>>>> multiply a copy of the result matrix from the CPU to the GPU (hugely
>>>>> expensive)
>>>>>
>>>>> For performance there needs to be a new routine 
>>>>> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense()
>>>>> that is smarter about the needed MPI communication so it only moves 
>>>>> exactly
>>>>> what it needs to the other ranks and it does the off-diagonal part of the
>>>>> product on the GPU so it does not need to copy the result up to the CPU.
>>>>>
>>>>>
>>>> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND
>>>>
>>>> Rohan
>>>> I would suggest to add PetscLogStage around your performance loop (do a
>>>> warmup outside of it) and send the relevant portion of the log
>>>>
>>>>
>>>>> Barry
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------- PETSc Performance Summary: 
>>>>> ----------------------------------------------
>>>>>
>>>>> /g/g15/yadav2/taco/petsc/bin/benchmark on a  named lassen457 with 2 
>>>>> processors, by yadav2 Wed Feb  2 17:23:19 2022
>>>>> Using Petsc Release Version 3.16.3, unknown
>>>>>
>>>>>                          Max       Max/Min     Avg       Total
>>>>> Time (sec):           1.163e+02     1.000   1.163e+02
>>>>> Objects:              4.800e+01     1.000   4.800e+01
>>>>> Flop:                 6.338e+11     1.065   6.144e+11  1.229e+12
>>>>> Flop/sec:             5.451e+09     1.065   5.284e+09  1.057e+10
>>>>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>>>>> MPI Message Lengths:  2.544e+09     1.000   7.267e+07  5.087e+09
>>>>> MPI Reductions:       8.100e+01     1.000
>>>>>
>>>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>>>> (multiply/divide/add/subtract)
>>>>>                             e.g., VecAXPY() for real vectors of length N 
>>>>> --> 2N flop
>>>>>                             and VecAXPY() for complex vectors of length N 
>>>>> --> 8N flop
>>>>>
>>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages 
>>>>> ---  -- Message Lengths --  -- Reductions --
>>>>>                         Avg     %Total     Avg     %Total    Count   
>>>>> %Total     Avg         %Total    Count   %Total
>>>>>  0:      Main Stage: 1.1628e+02 100.0%  1.2288e+12 100.0%  7.000e+01 
>>>>> 100.0%  7.267e+07      100.0%  6.300e+01  77.8%
>>>>>
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> See the 'Profiling' chapter of the users' manual for details on 
>>>>> interpreting output.
>>>>> Phase summary info:
>>>>>    Count: number of times phase was executed
>>>>>    Time and Flop: Max - maximum over all processors
>>>>>                   Ratio - ratio of maximum to minimum over all processors
>>>>>    Mess: number of messages sent
>>>>>    AvgLen: average message length (bytes)
>>>>>    Reduct: number of global reductions
>>>>>    Global: entire computation
>>>>>    Stage: stages of a computation. Set stages with PetscLogStagePush() 
>>>>> and PetscLogStagePop().
>>>>>       %T - percent time in this phase         %F - percent flop in this 
>>>>> phase
>>>>>       %M - percent messages in this phase     %L - percent message 
>>>>> lengths in this phase
>>>>>       %R - percent reductions in this phase
>>>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time 
>>>>> over all processors)
>>>>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU 
>>>>> time over all processors)
>>>>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>>>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
>>>>> processor)
>>>>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>>>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
>>>>> processor)
>>>>>    GPU %F: percent flops on GPU in this event
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> Event                Count      Time (sec)     Flop                       
>>>>>        --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
>>>>> GpuToCpu - GPU
>>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
>>>>> Count   Size  %F
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> --- Event Stage 0: Main Stage
>>>>>
>>>>> BuildTwoSided          2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 
>>>>> 2.0e+00  0  0  3  0  2   0  0  3  0  3     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> BuildTwoSidedF         1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 
>>>>> 0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 
>>>>> 0.00e+00    0 0.00e+00  0
>>>>> MatAssemblyBegin      32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 
>>>>> 0.0e+00 1.0e+00  0  0  0  0  1   0  0  0  0  2     0       0      0 
>>>>> 0.00e+00    0 0.00e+00  0
>>>>> MatAssemblyEnd        32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 6.0e+00  1  0  0  0  7   1  0  0  0 10     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> MatZeroEntries         1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> MatLoad                1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 
>>>>> 2.1e+01 15  0  9-10 26  15  0  9-10 33     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> MatMatMultSym         60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 
>>>>> 3.2e+01  1  0  6  0 40   1  0  6  0 51     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> MatMatMultNum         30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 
>>>>> 0.0e+00 37100 86110  0  37100 86110  0 28598   920026      2 6.71e+03   
>>>>> 30 8.73e+04 98
>>>>> MatCUSPARSCopyTo       1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 3.80e+03    0 
>>>>> 0.00e+00  0
>>>>> MatDenseCopyTo         1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      1 2.91e+03    0 
>>>>> 0.00e+00  0
>>>>> MatDenseCopyFrom      31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00 10  0  0  0  0  10  0  0  0  0     0       0      0 0.00e+00   31 
>>>>> 9.02e+04  0
>>>>> VecSet                 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> SFSetGraph             1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> SFSetUp                1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 
>>>>> 1.0e+00  0  0  6  0  1   0  0  6  0  2     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>>
>>>>> Memory usage is given in bytes:
>>>>>
>>>>> Object Type          Creations   Destructions     Memory  Descendants' 
>>>>> Mem.
>>>>> Reports information only for process 0.
>>>>>
>>>>> --- Event Stage 0: Main Stage
>>>>>
>>>>>               Matrix    37             30   2867511840     0.
>>>>>               Viewer     2              0            0     0.
>>>>>               Vector     4              1         1792     0.
>>>>>            Index Set     2              2      1495248     0.
>>>>>    Star Forest Graph     3              0            0     0.
>>>>> ========================================================================================================================
>>>>> Average time to get PetscTime(): 3.83e-08
>>>>> Average time for MPI_Barrier(): 7.874e-07
>>>>> Average time for zero size MPI_Send(): 3.4035e-06
>>>>> #PETSc Option Table entries:
>>>>> -bench spmm
>>>>> -enable_gpu
>>>>> -log_view
>>>>> -mat_type aijcusparse
>>>>> -matload_block_size 1
>>>>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
>>>>> -n 20
>>>>> -vec_type cuda
>>>>> -warmup 10
>>>>> ```
>>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>>
>>>>> Rohan Yadav
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Stefano
>>>>
>>>
>>>
>>
>

Re: [petsc-users] PETSc GPU MatMatMult performance question

Reply via email to