Mark,
Good eye. Something is definitely very different between this run and the
previous (options, code change?). In the previously sent runs it was about 98%
on GPU.
Barry
> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <[email protected]> wrote:
>
> > Please send the code that builds the sparse B matrix and the
> > setMatToConstant() routine.
>
> Setting to a constant:
> ```
> void setMatToConstant(Mat mat, PetscScalar c) {
> PetscInt rStart, rEnd, m, n;
> MatGetSize(mat, &m, &n);
> MatGetOwnershipRange(mat, &rStart, &rEnd);
> for (int i = rStart; i < rEnd; i++) {
> for (int j = 0; j < n; j++) {
> MatSetValue(mat, i, j, c, INSERT_VALUES);
> }
> }
> MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY);
> MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY);
> }
> ```
>
> Loading sparse matrix from disk:
> ```
> int loadMatrixFromFile(Mat* A, char* filename) {
> auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr);
> MatSetFromOptions(*A);
> PetscViewer viewer;
> PetscViewerCreate(PETSC_COMM_WORLD, &viewer);
> PetscViewerSetType(viewer, PETSCVIEWERBINARY);
> PetscViewerFileSetMode(viewer, FILE_MODE_READ);
> PetscViewerFileSetName(viewer, filename);
> MatLoad(*A, viewer);
> return 0;
> }
> ```
> These are only called once and should not affect the computation in a loop
> though.
> > But first please verify that if you run with one MPI rank the "on GPU" and
> > the overall flop rates for the MatMatMult() are almost the same and there
> > is no copy from the GPU for each multiply?
>
> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I move
> to 2 ranks I see this behavior.
>
> Here are updated logs with a new stage for 2 ranks. I've staged the logs into
> "MyComputation".
>
> ```
> ---------------------------------------------- PETSc Performance Summary:
> ----------------------------------------------
>
> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen572 with 2
> processors, by yadav2 Thu Feb 3 09:27:30 2022
> Using Petsc Release Version 3.16.3, unknown
>
> Max Max/Min Avg Total
> Time (sec): 2.091e+02 1.001 2.090e+02
> Objects: 4.800e+01 1.000 4.800e+01
> Flop: 4.344e+11 1.019 4.303e+11 8.606e+11
> Flop/sec: 2.077e+09 1.018 2.059e+09 4.118e+09
> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
> MPI Message Lengths: 6.316e+10 1.000 1.805e+09 1.263e+11
> MPI Reductions: 8.100e+01 1.000
>
> Flop counting convention: 1 flop = 1 real number operation of type
> (multiply/divide/add/subtract)
> e.g., VecAXPY() for real vectors of length N -->
> 2N flop
> and VecAXPY() for complex vectors of length N -->
> 8N flop
>
> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
> -- Message Lengths -- -- Reductions --
> Avg %Total Avg %Total Count %Total
> Avg %Total Count %Total
> 0: Main Stage: 1.0555e+02 50.5% 2.8686e+11 33.3% 3.000e+01 42.9%
> 1.466e+09 34.8% 4.300e+01 53.1%
> 1: MyComputation: 1.0345e+02 49.5% 5.7373e+11 66.7% 4.000e+01 57.1%
> 2.058e+09 65.2% 2.000e+01 24.7%
>
> ------------------------------------------------------------------------------------------------------------------------
> See the 'Profiling' chapter of the users' manual for details on interpreting
> output.
> Phase summary info:
> Count: number of times phase was executed
> Time and Flop: Max - maximum over all processors
> Ratio - ratio of maximum to minimum over all processors
> Mess: number of messages sent
> AvgLen: average message length (bytes)
> Reduct: number of global reductions
> Global: entire computation
> Stage: stages of a computation. Set stages with PetscLogStagePush() and
> PetscLogStagePop().
> %T - percent time in this phase %F - percent flop in this phase
> %M - percent messages in this phase %L - percent message lengths in
> this phase
> %R - percent reductions in this phase
> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
> all processors)
> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
> time over all processors)
> CpuToGpu Count: total number of CPU to GPU copies per processor
> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
> processor)
> GpuToCpu Count: total number of GPU to CPU copies per processor
> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
> processor)
> GPU %F: percent flops on GPU in this event
> ------------------------------------------------------------------------------------------------------------------------
> Event Count Time (sec) Flop
> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu -
> GPU
> Max Ratio Max Ratio Max Ratio Mess AvgLen
> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count
> Size %F
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> --- Event Stage 0: Main Stage
>
> BuildTwoSided 2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00
> 2.0e+00 0 0 3 0 2 0 0 7 0 5 0 0 0 0.00e+00 0
> 0.00e+00 0
> BuildTwoSidedF 1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyBegin 12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyEnd 12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 6.0e+00 2 0 0 0 7 3 0 0 0 14 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatZeroEntries 1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatLoad 1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08
> 2.1e+01 7 0 9 2 26 13 0 20 6 49 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatMatMultSym 20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07
> 1.2e+01 2 0 6 0 15 3 0 13 0 28 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatMatMultNum 10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09
> 0.0e+00 23 33 29 33 0 46100 67 94 0 5754 182686 2 2.23e+03 10
> 2.08e+04 5
> MatCUSPARSCopyTo 1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 1.55e+02 0
> 0.00e+00 0
> MatDenseCopyTo 1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.08e+03 0
> 0.00e+00 0
> MatDenseCopyFrom 11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 11
> 2.29e+04 0
> VecSet 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> SFSetGraph 1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> SFSetUp 1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07
> 1.0e+00 0 0 6 0 1 0 0 13 0 2 0 0 0 0.00e+00 0
> 0.00e+00 0
>
> --- Event Stage 1: MyComputation
>
> MatAssemblyBegin 20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatAssemblyEnd 20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatMatMultSym 40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00
> 2.0e+01 3 0 0 0 25 7 0 0 0100 0 0 0 0.00e+00 0
> 0.00e+00 0
> MatMatMultNum 20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09
> 0.0e+00 46 67 57 65 0 93100100100 0 5777 182577 0 0.00e+00 20
> 4.16e+04 5
> MatDenseCopyFrom 20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00 3 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 20
> 4.16e+04 0
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Memory usage is given in bytes:
>
> Object Type Creations Destructions Memory Descendants' Mem.
> Reports information only for process 0.
>
> --- Event Stage 0: Main Stage
>
> Matrix 17 10 20381695840 0.
> Viewer 2 0 0 0.
> Vector 4 1 1792 0.
> Index Set 2 2 31848152 0.
> Star Forest Graph 3 0 0 0.
>
> --- Event Stage 1: MyComputation
>
> Matrix 20 20 40763391680 0.
> ========================================================================================================================
> Average time to get PetscTime(): 3.96e-08
> Average time for MPI_Barrier(): 8.184e-07
> Average time for zero size MPI_Send(): 2.8165e-06
> #PETSc Option Table entries:
> -bench spmm
> -enable_gpu
> -log_view
> -mat_type aijcusparse
> -matload_block_size 1
> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc
> -n 20
> -vec_type cuda
> -warmup 10
> #End of PETSc Option Table entries
> Compiled without FORTRAN kernels
> Compiled with full precision matrices (default)
> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8
> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
> Configure options: --download-c2html=0 --download-hwloc=0 --download-sowing=0
> --prefix=./petsc-install/ --with-64-bit-indices=0
> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so"
> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
> --with-clanguage=C --with-cxx-dialect=C++17
> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
> --with-cuda=1 --with-debugging=0
> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
> --with-fftw=0
> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0
> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0
> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0
> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so
> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3"
> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS=
> PETSC_ARCH=arch-linux-c-opt
> -----------------------------------------
> Libraries compiled on 2022-01-21 06:41:50 on lassen111
> Machine characteristics:
> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
> Using PETSc arch:
> -----------------------------------------
>
> Using C compiler:
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
> -g -DNoChange -fPIC "-O3"
> Using Fortran compiler:
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
> -g -fPIC
> -----------------------------------------
>
> Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include
> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
> -----------------------------------------
>
> Using C linker:
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
> Using Fortran linker:
> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
> Using libraries: -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib
> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc
> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib
> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>
> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64
> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64
> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64
> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib
> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl -lhdf5
> -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse -lcusolver
> -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi -lmpi_ibm_mpifh
> -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath -lpthread
> -lquadmath -lstdc++ -ldl
> -----------------------------------------
> ```
>
> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini <[email protected]
> <mailto:[email protected]>> wrote:
>
>
> 1) It uses MatMPIDenseScatter() to move to the other ranks their needed rows
> of the C matrix. That function has the call MatDenseGetArrayRead() normally
> would trigger a copy of C up to the CPU each time. But since C is not
> changing in your test run I guess it only triggers one copy.
>
> 2) If uses
> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr);
> to do the off diagonal part of the product but this triggers for each
> multiply a copy of the result matrix from the CPU to the GPU (hugely
> expensive)
>
> For performance there needs to be a new routine
> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() that is smarter about the
> needed MPI communication so it only moves exactly what it needs to the other
> ranks and it does the off-diagonal part of the product on the GPU so it does
> not need to copy the result up to the CPU.
>
>
> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND
>
> Rohan
> I would suggest to add PetscLogStage around your performance loop (do a
> warmup outside of it) and send the relevant portion of the log
>
> Barry
>
>
>
>
>
>
>> ---------------------------------------------- PETSc Performance Summary:
>> ----------------------------------------------
>>
>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2
>> processors, by yadav2 Wed Feb 2 17:23:19 2022
>> Using Petsc Release Version 3.16.3, unknown
>>
>> Max Max/Min Avg Total
>> Time (sec): 1.163e+02 1.000 1.163e+02
>> Objects: 4.800e+01 1.000 4.800e+01
>> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12
>> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10
>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01
>> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09
>> MPI Reductions: 8.100e+01 1.000
>>
>> Flop counting convention: 1 flop = 1 real number operation of type
>> (multiply/divide/add/subtract)
>> e.g., VecAXPY() for real vectors of length N -->
>> 2N flop
>> and VecAXPY() for complex vectors of length N
>> --> 8N flop
>>
>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages ---
>> -- Message Lengths -- -- Reductions --
>> Avg %Total Avg %Total Count %Total
>> Avg %Total Count %Total
>> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0%
>> 7.267e+07 100.0% 6.300e+01 77.8%
>>
>> ------------------------------------------------------------------------------------------------------------------------
>> See the 'Profiling' chapter of the users' manual for details on interpreting
>> output.
>> Phase summary info:
>> Count: number of times phase was executed
>> Time and Flop: Max - maximum over all processors
>> Ratio - ratio of maximum to minimum over all processors
>> Mess: number of messages sent
>> AvgLen: average message length (bytes)
>> Reduct: number of global reductions
>> Global: entire computation
>> Stage: stages of a computation. Set stages with PetscLogStagePush() and
>> PetscLogStagePop().
>> %T - percent time in this phase %F - percent flop in this phase
>> %M - percent messages in this phase %L - percent message lengths
>> in this phase
>> %R - percent reductions in this phase
>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over
>> all processors)
>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU
>> time over all processors)
>> CpuToGpu Count: total number of CPU to GPU copies per processor
>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per
>> processor)
>> GpuToCpu Count: total number of GPU to CPU copies per processor
>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per
>> processor)
>> GPU %F: percent flops on GPU in this event
>> ------------------------------------------------------------------------------------------------------------------------
>> Event Count Time (sec) Flop
>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu
>> - GPU
>> Max Ratio Max Ratio Max Ratio Mess AvgLen
>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count
>> Size %F
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> --- Event Stage 0: Main Stage
>>
>> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00
>> 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00
>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00
>> 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07
>> 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05
>> 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07
>> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30
>> 8.73e+04 98
>> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0
>> 0.00e+00 0
>> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0
>> 0.00e+00 0
>> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31
>> 9.02e+04 0
>> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00
>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05
>> 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0
>> 0.00e+00 0
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>
>> Memory usage is given in bytes:
>>
>> Object Type Creations Destructions Memory Descendants' Mem.
>> Reports information only for process 0.
>>
>> --- Event Stage 0: Main Stage
>>
>> Matrix 37 30 2867511840 0.
>> Viewer 2 0 0 0.
>> Vector 4 1 1792 0.
>> Index Set 2 2 1495248 0.
>> Star Forest Graph 3 0 0 0.
>> ========================================================================================================================
>> Average time to get PetscTime(): 3.83e-08
>> Average time for MPI_Barrier(): 7.874e-07
>> Average time for zero size MPI_Send(): 3.4035e-06
>> #PETSc Option Table entries:
>> -bench spmm
>> -enable_gpu
>> -log_view
>> -mat_type aijcusparse
>> -matload_block_size 1
>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc
>> -n 20
>> -vec_type cuda
>> -warmup 10
>> ```
>>
>> Thanks,
>>
>> Rohan Yadav
>>
>
>
>
> --
> Stefano