I suspect the new matrix has a very different parallel nonzero structure that results in MOST of the calculations taking place on the CPU (since the "off-diagonal" part of the matrix dominates the non-zero pattern). PETSc is not designed for this type of nonzero structure and will give a bad performance (CPU or GPU); it is not a "PDE-ish" type of nonzero structure.
> On Feb 3, 2022, at 2:59 PM, Rohan Yadav <roh...@alumni.cmu.edu> wrote: > > I'm sorry, I did a little switch here. The original log view I sent for 2 > runs was on a different input matrix. Based on Barry's request I switched to > a different matrix as the original one did not fit on 1 GPU. > > > In the previously sent runs it was about 98% on GPU. > > Re 98% on the GPU though, my first email had a similar ratio in the log > though: > ``` > MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 > 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 > 8.73e+04 98 > ``` > The follow up log might be slightly different as well because I pushed a new > log stage as requested by Stefano. > > Rohan > > On Thu, Feb 3, 2022 at 11:50 AM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: > > Mark, > > Good eye. Something is definitely very different between this run and the > previous (options, code change?). In the previously sent runs it was about > 98% on GPU. > > Barry > > >> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <roh...@alumni.cmu.edu >> <mailto:roh...@alumni.cmu.edu>> wrote: >> >> > Please send the code that builds the sparse B matrix and the >> > setMatToConstant() routine. >> >> Setting to a constant: >> ``` >> void setMatToConstant(Mat mat, PetscScalar c) { >> PetscInt rStart, rEnd, m, n; >> MatGetSize(mat, &m, &n); >> MatGetOwnershipRange(mat, &rStart, &rEnd); >> for (int i = rStart; i < rEnd; i++) { >> for (int j = 0; j < n; j++) { >> MatSetValue(mat, i, j, c, INSERT_VALUES); >> } >> } >> MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY); >> MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY); >> } >> ``` >> >> Loading sparse matrix from disk: >> ``` >> int loadMatrixFromFile(Mat* A, char* filename) { >> auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr); >> MatSetFromOptions(*A); >> PetscViewer viewer; >> PetscViewerCreate(PETSC_COMM_WORLD, &viewer); >> PetscViewerSetType(viewer, PETSCVIEWERBINARY); >> PetscViewerFileSetMode(viewer, FILE_MODE_READ); >> PetscViewerFileSetName(viewer, filename); >> MatLoad(*A, viewer); >> return 0; >> } >> ``` >> These are only called once and should not affect the computation in a loop >> though. >> > But first please verify that if you run with one MPI rank the "on GPU" and >> > the overall flop rates for the MatMatMult() are almost the same and there >> > is no copy from the GPU for each multiply? >> >> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I move >> to 2 ranks I see this behavior. >> >> Here are updated logs with a new stage for 2 ranks. I've staged the logs >> into "MyComputation". >> >> ``` >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen572 with 2 >> processors, by yadav2 Thu Feb 3 09:27:30 2022 >> Using Petsc Release Version 3.16.3, unknown >> >> Max Max/Min Avg Total >> Time (sec): 2.091e+02 1.001 2.090e+02 >> Objects: 4.800e+01 1.000 4.800e+01 >> Flop: 4.344e+11 1.019 4.303e+11 8.606e+11 >> Flop/sec: 2.077e+09 1.018 2.059e+09 4.118e+09 >> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >> MPI Message Lengths: 6.316e+10 1.000 1.805e+09 1.263e+11 >> MPI Reductions: 8.100e+01 1.000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length N --> >> 2N flop >> and VecAXPY() for complex vectors of length N >> --> 8N flop >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- >> -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total Count %Total >> Avg %Total Count %Total >> 0: Main Stage: 1.0555e+02 50.5% 2.8686e+11 33.3% 3.000e+01 42.9% >> 1.466e+09 34.8% 4.300e+01 53.1% >> 1: MyComputation: 1.0345e+02 49.5% 5.7373e+11 66.7% 4.000e+01 57.1% >> 2.058e+09 65.2% 2.000e+01 24.7% >> >> ------------------------------------------------------------------------------------------------------------------------ >> See the 'Profiling' chapter of the users' manual for details on interpreting >> output. >> Phase summary info: >> Count: number of times phase was executed >> Time and Flop: Max - maximum over all processors >> Ratio - ratio of maximum to minimum over all processors >> Mess: number of messages sent >> AvgLen: average message length (bytes) >> Reduct: number of global reductions >> Global: entire computation >> Stage: stages of a computation. Set stages with PetscLogStagePush() and >> PetscLogStagePop(). >> %T - percent time in this phase %F - percent flop in this phase >> %M - percent messages in this phase %L - percent message lengths >> in this phase >> %R - percent reductions in this phase >> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over >> all processors) >> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >> time over all processors) >> CpuToGpu Count: total number of CPU to GPU copies per processor >> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >> processor) >> GpuToCpu Count: total number of GPU to CPU copies per processor >> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >> processor) >> GPU %F: percent flops on GPU in this event >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu >> - GPU >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count >> Size %F >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> --- Event Stage 0: Main Stage >> >> BuildTwoSided 2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00 >> 2.0e+00 0 0 3 0 2 0 0 7 0 5 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> BuildTwoSidedF 1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatAssemblyBegin 12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatAssemblyEnd 12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 6.0e+00 2 0 0 0 7 3 0 0 0 14 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatZeroEntries 1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatLoad 1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08 >> 2.1e+01 7 0 9 2 26 13 0 20 6 49 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultSym 20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07 >> 1.2e+01 2 0 6 0 15 3 0 13 0 28 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultNum 10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09 >> 0.0e+00 23 33 29 33 0 46100 67 94 0 5754 182686 2 2.23e+03 10 >> 2.08e+04 5 >> MatCUSPARSCopyTo 1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 1.55e+02 0 >> 0.00e+00 0 >> MatDenseCopyTo 1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.08e+03 0 >> 0.00e+00 0 >> MatDenseCopyFrom 11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 11 >> 2.29e+04 0 >> VecSet 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFSetGraph 1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFSetUp 1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07 >> 1.0e+00 0 0 6 0 1 0 0 13 0 2 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> --- Event Stage 1: MyComputation >> >> MatAssemblyBegin 20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatAssemblyEnd 20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultSym 40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 >> 2.0e+01 3 0 0 0 25 7 0 0 0100 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultNum 20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09 >> 0.0e+00 46 67 57 65 0 93100100100 0 5777 182577 0 0.00e+00 20 >> 4.16e+04 5 >> MatDenseCopyFrom 20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 3 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 20 >> 4.16e+04 0 >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> Memory usage is given in bytes: >> >> Object Type Creations Destructions Memory Descendants' Mem. >> Reports information only for process 0. >> >> --- Event Stage 0: Main Stage >> >> Matrix 17 10 20381695840 0. >> Viewer 2 0 0 0. >> Vector 4 1 1792 0. >> Index Set 2 2 31848152 0. >> Star Forest Graph 3 0 0 0. >> >> --- Event Stage 1: MyComputation >> >> Matrix 20 20 40763391680 0. >> ======================================================================================================================== >> Average time to get PetscTime(): 3.96e-08 >> Average time for MPI_Barrier(): 8.184e-07 >> Average time for zero size MPI_Send(): 2.8165e-06 >> #PETSc Option Table entries: >> -bench spmm >> -enable_gpu >> -log_view >> -mat_type aijcusparse >> -matload_block_size 1 >> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc >> -n 20 >> -vec_type cuda >> -warmup 10 >> #End of PETSc Option Table entries >> Compiled without FORTRAN kernels >> Compiled with full precision matrices (default) >> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 >> Configure options: --download-c2html=0 --download-hwloc=0 >> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 >> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so >> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" >> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> --with-clanguage=C --with-cxx-dialect=C++17 >> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ >> --with-cuda=1 --with-debugging=0 >> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> --with-fftw=0 >> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 >> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 >> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 >> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 >> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so >> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" >> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= >> PETSC_ARCH=arch-linux-c-opt >> ----------------------------------------- >> Libraries compiled on 2022-01-21 06:41:50 on lassen111 >> Machine characteristics: >> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo >> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install >> Using PETSc arch: >> ----------------------------------------- >> >> Using C compiler: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> -g -DNoChange -fPIC "-O3" >> Using Fortran compiler: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> -g -fPIC >> ----------------------------------------- >> >> Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include >> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include >> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include >> ----------------------------------------- >> >> Using C linker: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> Using Fortran linker: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> Using libraries: -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib >> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc >> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >> >> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 >> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 >> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl >> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse >> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi >> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath >> -lpthread -lquadmath -lstdc++ -ldl >> ----------------------------------------- >> ``` >> >> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini <stefano.zamp...@gmail.com >> <mailto:stefano.zamp...@gmail.com>> wrote: >> >> >> 1) It uses MatMPIDenseScatter() to move to the other ranks their needed rows >> of the C matrix. That function has the call MatDenseGetArrayRead() normally >> would trigger a copy of C up to the CPU each time. But since C is not >> changing in your test run I guess it only triggers one copy. >> >> 2) If uses >> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr); >> to do the off diagonal part of the product but this triggers for each >> multiply a copy of the result matrix from the CPU to the GPU (hugely >> expensive) >> >> For performance there needs to be a new routine >> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() that is smarter about the >> needed MPI communication so it only moves exactly what it needs to the other >> ranks and it does the off-diagonal part of the product on the GPU so it does >> not need to copy the result up to the CPU. >> >> >> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND >> >> Rohan >> I would suggest to add PetscLogStage around your performance loop (do a >> warmup outside of it) and send the relevant portion of the log >> >> Barry >> >> >> >> >> >> >>> ---------------------------------------------- PETSc Performance Summary: >>> ---------------------------------------------- >>> >>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 >>> processors, by yadav2 Wed Feb 2 17:23:19 2022 >>> Using Petsc Release Version 3.16.3, unknown >>> >>> Max Max/Min Avg Total >>> Time (sec): 1.163e+02 1.000 1.163e+02 >>> Objects: 4.800e+01 1.000 4.800e+01 >>> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12 >>> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10 >>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09 >>> MPI Reductions: 8.100e+01 1.000 >>> >>> Flop counting convention: 1 flop = 1 real number operation of type >>> (multiply/divide/add/subtract) >>> e.g., VecAXPY() for real vectors of length N >>> --> 2N flop >>> and VecAXPY() for complex vectors of length N >>> --> 8N flop >>> >>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- >>> -- Message Lengths -- -- Reductions -- >>> Avg %Total Avg %Total Count %Total >>> Avg %Total Count %Total >>> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 100.0% >>> 7.267e+07 100.0% 6.300e+01 77.8% >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> See the 'Profiling' chapter of the users' manual for details on >>> interpreting output. >>> Phase summary info: >>> Count: number of times phase was executed >>> Time and Flop: Max - maximum over all processors >>> Ratio - ratio of maximum to minimum over all processors >>> Mess: number of messages sent >>> AvgLen: average message length (bytes) >>> Reduct: number of global reductions >>> Global: entire computation >>> Stage: stages of a computation. Set stages with PetscLogStagePush() and >>> PetscLogStagePop(). >>> %T - percent time in this phase %F - percent flop in this >>> phase >>> %M - percent messages in this phase %L - percent message lengths >>> in this phase >>> %R - percent reductions in this phase >>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over >>> all processors) >>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >>> time over all processors) >>> CpuToGpu Count: total number of CPU to GPU copies per processor >>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>> processor) >>> GpuToCpu Count: total number of GPU to CPU copies per processor >>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>> processor) >>> GPU %F: percent flops on GPU in this event >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>> GpuToCpu - GPU >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count >>> Size %F >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> --- Event Stage 0: Main Stage >>> >>> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 >>> 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 >>> 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 >>> 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 >>> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 >>> 8.73e+04 98 >>> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 >>> 0.00e+00 0 >>> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 >>> 0.00e+00 0 >>> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 >>> 9.02e+04 0 >>> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 >>> 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> Memory usage is given in bytes: >>> >>> Object Type Creations Destructions Memory Descendants' Mem. >>> Reports information only for process 0. >>> >>> --- Event Stage 0: Main Stage >>> >>> Matrix 37 30 2867511840 0. >>> Viewer 2 0 0 0. >>> Vector 4 1 1792 0. >>> Index Set 2 2 1495248 0. >>> Star Forest Graph 3 0 0 0. >>> ======================================================================================================================== >>> Average time to get PetscTime(): 3.83e-08 >>> Average time for MPI_Barrier(): 7.874e-07 >>> Average time for zero size MPI_Send(): 3.4035e-06 >>> #PETSc Option Table entries: >>> -bench spmm >>> -enable_gpu >>> -log_view >>> -mat_type aijcusparse >>> -matload_block_size 1 >>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc >>> -n 20 >>> -vec_type cuda >>> -warmup 10 >>> ``` >>> >>> Thanks, >>> >>> Rohan Yadav >>> >> >> >> >> -- >> Stefano >