I'm sorry, I did a little switch here. The original log view I sent for 2 runs was on a different input matrix. Based on Barry's request I switched to a different matrix as the original one did not fit on 1 GPU.
> In the previously sent runs it was about 98% on GPU. Re 98% on the GPU though, my first email had a similar ratio in the log though: ``` MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 8.73e+04 98 ``` The follow up log might be slightly different as well because I pushed a new log stage as requested by Stefano. Rohan On Thu, Feb 3, 2022 at 11:50 AM Barry Smith <[email protected]> wrote: > > Mark, > > Good eye. Something is definitely very different between this run and > the previous (options, code change?). In the previously sent runs it was > about 98% on GPU. > > Barry > > > On Feb 3, 2022, at 12:29 PM, Rohan Yadav <[email protected]> wrote: > > > Please send the code that builds the sparse B matrix and the > > setMatToConstant() > routine. > > Setting to a constant: > ``` > void setMatToConstant(Mat mat, PetscScalar c) { > > PetscInt rStart, rEnd, m, n; > MatGetSize(mat, &m, &n); > MatGetOwnershipRange(mat, &rStart, &rEnd); > for (int i = rStart; i < rEnd; i++) { > for (int j = 0; j < n; j++) { > MatSetValue(mat, i, j, c, INSERT_VALUES); > } > } > MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY); > MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY); > } > ``` > > Loading sparse matrix from disk: > > ``` > > int loadMatrixFromFile(Mat* A, char* filename) { > auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr); > MatSetFromOptions(*A); > PetscViewer viewer; > PetscViewerCreate(PETSC_COMM_WORLD, &viewer); > PetscViewerSetType(viewer, PETSCVIEWERBINARY); > PetscViewerFileSetMode(viewer, FILE_MODE_READ); > PetscViewerFileSetName(viewer, filename); > MatLoad(*A, viewer); > return 0; > } > > ``` > > These are only called once and should not affect the computation in a loop > though. > > > But first please verify that if you run with one MPI rank the "on GPU" and > > the overall flop rates for the MatMatMult() are almost the same and there > > is no copy from the GPU for each multiply? > > > Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I > move to 2 ranks I see this behavior. > > Here are updated logs with a new stage for 2 ranks. I've staged the logs > into "MyComputation". > > ``` > ---------------------------------------------- PETSc Performance Summary: > ---------------------------------------------- > > /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen572 with 2 > processors, by yadav2 Thu Feb 3 09:27:30 2022 > Using Petsc Release Version 3.16.3, unknown > > Max Max/Min Avg Total > Time (sec): 2.091e+02 1.001 2.090e+02 > Objects: 4.800e+01 1.000 4.800e+01 > Flop: 4.344e+11 1.019 4.303e+11 8.606e+11 > Flop/sec: 2.077e+09 1.018 2.059e+09 4.118e+09 > MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 > MPI Message Lengths: 6.316e+10 1.000 1.805e+09 1.263e+11 > MPI Reductions: 8.100e+01 1.000 > > Flop counting convention: 1 flop = 1 real number operation of type > (multiply/divide/add/subtract) > e.g., VecAXPY() for real vectors of length N > --> 2N flop > and VecAXPY() for complex vectors of length N > --> 8N flop > > Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages > --- -- Message Lengths -- -- Reductions -- > Avg %Total Avg %Total Count > %Total Avg %Total Count %Total > 0: Main Stage: 1.0555e+02 50.5% 2.8686e+11 33.3% 3.000e+01 > 42.9% 1.466e+09 34.8% 4.300e+01 53.1% > 1: MyComputation: 1.0345e+02 49.5% 5.7373e+11 66.7% 4.000e+01 > 57.1% 2.058e+09 65.2% 2.000e+01 24.7% > > > ------------------------------------------------------------------------------------------------------------------------ > See the 'Profiling' chapter of the users' manual for details on > interpreting output. > Phase summary info: > Count: number of times phase was executed > Time and Flop: Max - maximum over all processors > Ratio - ratio of maximum to minimum over all processors > Mess: number of messages sent > AvgLen: average message length (bytes) > Reduct: number of global reductions > Global: entire computation > Stage: stages of a computation. Set stages with PetscLogStagePush() and > PetscLogStagePop(). > %T - percent time in this phase %F - percent flop in this > phase > %M - percent messages in this phase %L - percent message lengths > in this phase > %R - percent reductions in this phase > Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over > all processors) > GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU > time over all processors) > CpuToGpu Count: total number of CPU to GPU copies per processor > CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per > processor) > GpuToCpu Count: total number of GPU to CPU copies per processor > GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per > processor) > GPU %F: percent flops on GPU in this event > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > GpuToCpu - GPU > Max Ratio Max Ratio Max Ratio Mess AvgLen > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size > Count Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > --- Event Stage 0: Main Stage > > BuildTwoSided 2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00 > 2.0e+00 0 0 3 0 2 0 0 7 0 5 0 0 0 0.00e+00 0 > 0.00e+00 0 > BuildTwoSidedF 1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 > 0.00e+00 0 0.00e+00 0 > MatAssemblyBegin 12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyEnd 12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 6.0e+00 2 0 0 0 7 3 0 0 0 14 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatZeroEntries 1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatLoad 1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08 > 2.1e+01 7 0 9 2 26 13 0 20 6 49 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMatMultSym 20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07 > 1.2e+01 2 0 6 0 15 3 0 13 0 28 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMatMultNum 10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09 > 0.0e+00 23 33 29 33 0 46100 67 94 0 5754 182686 2 2.23e+03 10 > 2.08e+04 5 > MatCUSPARSCopyTo 1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 1.55e+02 0 > 0.00e+00 0 > MatDenseCopyTo 1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.08e+03 0 > 0.00e+00 0 > MatDenseCopyFrom 11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 11 > 2.29e+04 0 > VecSet 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFSetGraph 1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > SFSetUp 1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07 > 1.0e+00 0 0 6 0 1 0 0 13 0 2 0 0 0 0.00e+00 0 > 0.00e+00 0 > > --- Event Stage 1: MyComputation > > MatAssemblyBegin 20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatAssemblyEnd 20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMatMultSym 40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 2.0e+01 3 0 0 0 25 7 0 0 0100 0 0 0 0.00e+00 0 > 0.00e+00 0 > MatMatMultNum 20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09 > 0.0e+00 46 67 57 65 0 93100100100 0 5777 182577 0 0.00e+00 20 > 4.16e+04 5 > MatDenseCopyFrom 20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 3 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 20 > 4.16e+04 0 > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > Memory usage is given in bytes: > > Object Type Creations Destructions Memory Descendants' Mem. > Reports information only for process 0. > > --- Event Stage 0: Main Stage > > Matrix 17 10 20381695840 0. > Viewer 2 0 0 0. > Vector 4 1 1792 0. > Index Set 2 2 31848152 0. > Star Forest Graph 3 0 0 0. > > --- Event Stage 1: MyComputation > > Matrix 20 20 40763391680 0. > > ======================================================================================================================== > Average time to get PetscTime(): 3.96e-08 > Average time for MPI_Barrier(): 8.184e-07 > Average time for zero size MPI_Send(): 2.8165e-06 > #PETSc Option Table entries: > -bench spmm > -enable_gpu > -log_view > -mat_type aijcusparse > -matload_block_size 1 > -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc > -n 20 > -vec_type cuda > -warmup 10 > #End of PETSc Option Table entries > Compiled without FORTRAN kernels > Compiled with full precision matrices (default) > sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 > sizeof(PetscScalar) 8 sizeof(PetscInt) 4 > Configure options: --download-c2html=0 --download-hwloc=0 > --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 > --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so > /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" > --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc > --with-clanguage=C --with-cxx-dialect=C++17 > --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ > --with-cuda=1 --with-debugging=0 > --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran > --with-fftw=0 > --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 > --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 > --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 > --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 > --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so > --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" > CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= > PETSC_ARCH=arch-linux-c-opt > ----------------------------------------- > Libraries compiled on 2022-01-21 06:41:50 on lassen111 > Machine characteristics: > Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo > Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install > Using PETSc arch: > ----------------------------------------- > > Using C compiler: > /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc > -g -DNoChange -fPIC "-O3" > Using Fortran compiler: > /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran > -g -fPIC > ----------------------------------------- > > Using include paths: > -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include > -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include > -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include > ----------------------------------------- > > Using C linker: > /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc > Using Fortran linker: > /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran > Using libraries: > -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib > -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc > -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib > -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib > -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib > -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib > -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 > -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 > -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib > -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib > -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 > -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 > -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc > -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc > -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 > -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 > -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib > -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl > -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse > -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi > -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath > -lpthread -lquadmath -lstdc++ -ldl > ----------------------------------------- > ``` > > On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini <[email protected]> > wrote: > >> >> >> 1) It uses MatMPIDenseScatter() to move to the other ranks their needed >>> rows of the C matrix. That function has the call MatDenseGetArrayRead() >>> normally would trigger a copy of C up to the CPU each time. But since C is >>> not changing in your test run I guess it only triggers one copy. >>> >>> 2) If uses >>> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr); >>> to do the off diagonal part of the product but this triggers for each >>> multiply a copy of the result matrix from the CPU to the GPU (hugely >>> expensive) >>> >>> For performance there needs to be a new routine >>> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() >>> that is smarter about the needed MPI communication so it only moves exactly >>> what it needs to the other ranks and it does the off-diagonal part of the >>> product on the GPU so it does not need to copy the result up to the CPU. >>> >>> >> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND >> >> Rohan >> I would suggest to add PetscLogStage around your performance loop (do a >> warmup outside of it) and send the relevant portion of the log >> >> >>> Barry >>> >>> >>> >>> >>> >>> >>> ---------------------------------------------- PETSc Performance Summary: >>> ---------------------------------------------- >>> >>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 >>> processors, by yadav2 Wed Feb 2 17:23:19 2022 >>> Using Petsc Release Version 3.16.3, unknown >>> >>> Max Max/Min Avg Total >>> Time (sec): 1.163e+02 1.000 1.163e+02 >>> Objects: 4.800e+01 1.000 4.800e+01 >>> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12 >>> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10 >>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09 >>> MPI Reductions: 8.100e+01 1.000 >>> >>> Flop counting convention: 1 flop = 1 real number operation of type >>> (multiply/divide/add/subtract) >>> e.g., VecAXPY() for real vectors of length N >>> --> 2N flop >>> and VecAXPY() for complex vectors of length N >>> --> 8N flop >>> >>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- >>> -- Message Lengths -- -- Reductions -- >>> Avg %Total Avg %Total Count %Total >>> Avg %Total Count %Total >>> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0% >>> 7.267e+07 100.0% 6.300e+01 77.8% >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> See the 'Profiling' chapter of the users' manual for details on >>> interpreting output. >>> Phase summary info: >>> Count: number of times phase was executed >>> Time and Flop: Max - maximum over all processors >>> Ratio - ratio of maximum to minimum over all processors >>> Mess: number of messages sent >>> AvgLen: average message length (bytes) >>> Reduct: number of global reductions >>> Global: entire computation >>> Stage: stages of a computation. Set stages with PetscLogStagePush() and >>> PetscLogStagePop(). >>> %T - percent time in this phase %F - percent flop in this >>> phase >>> %M - percent messages in this phase %L - percent message lengths >>> in this phase >>> %R - percent reductions in this phase >>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over >>> all processors) >>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >>> time over all processors) >>> CpuToGpu Count: total number of CPU to GPU copies per processor >>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>> processor) >>> GpuToCpu Count: total number of GPU to CPU copies per processor >>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>> processor) >>> GPU %F: percent flops on GPU in this event >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>> GpuToCpu - GPU >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count >>> Size %F >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> --- Event Stage 0: Main Stage >>> >>> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 >>> 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 >>> 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 >>> 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 >>> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 >>> 8.73e+04 98 >>> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 >>> 0.00e+00 0 >>> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 >>> 0.00e+00 0 >>> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 >>> 9.02e+04 0 >>> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 >>> 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> Memory usage is given in bytes: >>> >>> Object Type Creations Destructions Memory Descendants' Mem. >>> Reports information only for process 0. >>> >>> --- Event Stage 0: Main Stage >>> >>> Matrix 37 30 2867511840 0. >>> Viewer 2 0 0 0. >>> Vector 4 1 1792 0. >>> Index Set 2 2 1495248 0. >>> Star Forest Graph 3 0 0 0. >>> ======================================================================================================================== >>> Average time to get PetscTime(): 3.83e-08 >>> Average time for MPI_Barrier(): 7.874e-07 >>> Average time for zero size MPI_Send(): 3.4035e-06 >>> #PETSc Option Table entries: >>> -bench spmm >>> -enable_gpu >>> -log_view >>> -mat_type aijcusparse >>> -matload_block_size 1 >>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc >>> -n 20 >>> -vec_type cuda >>> -warmup 10 >>> ``` >>> >>> >>> Thanks, >>> >>> >>> Rohan Yadav >>> >>> >>> >>> >> >> -- >> Stefano >> > >
