To be concrete, the first matrix was https://sparse.tamu.edu/LAW/arabic-2005 and the second was https://sparse.tamu.edu/Schenk/nlpkkt200 (which looks like it does come from the PDE domain?).
Regardless of the non-zero structure, there is still a significant hit when moving from 1 gpu to multiple GPUs that causes a large number of device to host copies to be performed. If this is a result of the PETSc implementation thats fine -- but if there's something I can do to work around that it would be great. Rohan On Thu, Feb 3, 2022 at 1:25 PM Barry Smith <[email protected]> wrote: > > I suspect the new matrix has a very different parallel nonzero structure > that results in MOST of the calculations taking place on the CPU (since the > "off-diagonal" part of the matrix dominates the non-zero pattern). PETSc is > not designed for this type of nonzero structure and will give a bad > performance (CPU or GPU); it is not a "PDE-ish" type of nonzero structure. > > > > On Feb 3, 2022, at 2:59 PM, Rohan Yadav <[email protected]> wrote: > > I'm sorry, I did a little switch here. The original log view I sent for 2 > runs was on a different input matrix. Based on Barry's request I switched > to a different matrix as the original one did not fit on 1 GPU. > > > In the previously sent runs it was about 98% on GPU. > > Re 98% on the GPU though, my first email had a similar ratio in the log > though: > ``` > > MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 > 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 > 8.73e+04 98 > > ``` > > The follow up log might be slightly different as well because I pushed a new > log stage as requested by Stefano. > > > Rohan > > > On Thu, Feb 3, 2022 at 11:50 AM Barry Smith <[email protected]> wrote: > >> >> Mark, >> >> Good eye. Something is definitely very different between this run and >> the previous (options, code change?). In the previously sent runs it was >> about 98% on GPU. >> >> Barry >> >> >> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <[email protected]> wrote: >> >> > Please send the code that builds the sparse B matrix and the >> > setMatToConstant() >> routine. >> >> Setting to a constant: >> ``` >> void setMatToConstant(Mat mat, PetscScalar c) { >> >> PetscInt rStart, rEnd, m, n; >> MatGetSize(mat, &m, &n); >> MatGetOwnershipRange(mat, &rStart, &rEnd); >> for (int i = rStart; i < rEnd; i++) { >> for (int j = 0; j < n; j++) { >> MatSetValue(mat, i, j, c, INSERT_VALUES); >> } >> } >> MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY); >> MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY); >> } >> ``` >> >> Loading sparse matrix from disk: >> >> ``` >> >> int loadMatrixFromFile(Mat* A, char* filename) { >> auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr); >> MatSetFromOptions(*A); >> PetscViewer viewer; >> PetscViewerCreate(PETSC_COMM_WORLD, &viewer); >> PetscViewerSetType(viewer, PETSCVIEWERBINARY); >> PetscViewerFileSetMode(viewer, FILE_MODE_READ); >> PetscViewerFileSetName(viewer, filename); >> MatLoad(*A, viewer); >> return 0; >> } >> >> ``` >> >> These are only called once and should not affect the computation in a loop >> though. >> >> > But first please verify that if you run with one MPI rank the "on GPU" and >> > the overall flop rates for the MatMatMult() are almost the same and there >> > is no copy from the GPU for each multiply? >> >> >> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I >> move to 2 ranks I see this behavior. >> >> Here are updated logs with a new stage for 2 ranks. I've staged the logs >> into "MyComputation". >> >> ``` >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen572 with 2 >> processors, by yadav2 Thu Feb 3 09:27:30 2022 >> Using Petsc Release Version 3.16.3, unknown >> >> Max Max/Min Avg Total >> Time (sec): 2.091e+02 1.001 2.090e+02 >> Objects: 4.800e+01 1.000 4.800e+01 >> Flop: 4.344e+11 1.019 4.303e+11 8.606e+11 >> Flop/sec: 2.077e+09 1.018 2.059e+09 4.118e+09 >> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >> MPI Message Lengths: 6.316e+10 1.000 1.805e+09 1.263e+11 >> MPI Reductions: 8.100e+01 1.000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length N >> --> 2N flop >> and VecAXPY() for complex vectors of length N >> --> 8N flop >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >> --- -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total Count >> %Total Avg %Total Count %Total >> 0: Main Stage: 1.0555e+02 50.5% 2.8686e+11 33.3% 3.000e+01 >> 42.9% 1.466e+09 34.8% 4.300e+01 53.1% >> 1: MyComputation: 1.0345e+02 49.5% 5.7373e+11 66.7% 4.000e+01 >> 57.1% 2.058e+09 65.2% 2.000e+01 24.7% >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> See the 'Profiling' chapter of the users' manual for details on >> interpreting output. >> Phase summary info: >> Count: number of times phase was executed >> Time and Flop: Max - maximum over all processors >> Ratio - ratio of maximum to minimum over all processors >> Mess: number of messages sent >> AvgLen: average message length (bytes) >> Reduct: number of global reductions >> Global: entire computation >> Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop(). >> %T - percent time in this phase %F - percent flop in this >> phase >> %M - percent messages in this phase %L - percent message >> lengths in this phase >> %R - percent reductions in this phase >> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >> over all processors) >> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >> time over all processors) >> CpuToGpu Count: total number of CPU to GPU copies per processor >> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >> processor) >> GpuToCpu Count: total number of GPU to CPU copies per processor >> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >> processor) >> GPU %F: percent flops on GPU in this event >> >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >> GpuToCpu - GPU >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >> Count Size %F >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> --- Event Stage 0: Main Stage >> >> BuildTwoSided 2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00 >> 2.0e+00 0 0 3 0 2 0 0 7 0 5 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> BuildTwoSidedF 1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> MatAssemblyBegin 12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 >> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> MatAssemblyEnd 12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 6.0e+00 2 0 0 0 7 3 0 0 0 14 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatZeroEntries 1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatLoad 1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08 >> 2.1e+01 7 0 9 2 26 13 0 20 6 49 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultSym 20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07 >> 1.2e+01 2 0 6 0 15 3 0 13 0 28 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultNum 10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09 >> 0.0e+00 23 33 29 33 0 46100 67 94 0 5754 182686 2 2.23e+03 10 >> 2.08e+04 5 >> MatCUSPARSCopyTo 1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 1.55e+02 0 >> 0.00e+00 0 >> MatDenseCopyTo 1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.08e+03 0 >> 0.00e+00 0 >> MatDenseCopyFrom 11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 11 >> 2.29e+04 0 >> VecSet 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFSetGraph 1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFSetUp 1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07 >> 1.0e+00 0 0 6 0 1 0 0 13 0 2 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> --- Event Stage 1: MyComputation >> >> MatAssemblyBegin 20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatAssemblyEnd 20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultSym 40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 >> 2.0e+01 3 0 0 0 25 7 0 0 0100 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatMatMultNum 20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09 >> 0.0e+00 46 67 57 65 0 93100100100 0 5777 182577 0 0.00e+00 20 >> 4.16e+04 5 >> MatDenseCopyFrom 20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 3 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 20 >> 4.16e+04 0 >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> Memory usage is given in bytes: >> >> Object Type Creations Destructions Memory Descendants' >> Mem. >> Reports information only for process 0. >> >> --- Event Stage 0: Main Stage >> >> Matrix 17 10 20381695840 0. >> Viewer 2 0 0 0. >> Vector 4 1 1792 0. >> Index Set 2 2 31848152 0. >> Star Forest Graph 3 0 0 0. >> >> --- Event Stage 1: MyComputation >> >> Matrix 20 20 40763391680 0. >> >> ======================================================================================================================== >> Average time to get PetscTime(): 3.96e-08 >> Average time for MPI_Barrier(): 8.184e-07 >> Average time for zero size MPI_Send(): 2.8165e-06 >> #PETSc Option Table entries: >> -bench spmm >> -enable_gpu >> -log_view >> -mat_type aijcusparse >> -matload_block_size 1 >> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc >> -n 20 >> -vec_type cuda >> -warmup 10 >> #End of PETSc Option Table entries >> Compiled without FORTRAN kernels >> Compiled with full precision matrices (default) >> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 >> Configure options: --download-c2html=0 --download-hwloc=0 >> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 >> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so >> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" >> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> --with-clanguage=C --with-cxx-dialect=C++17 >> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ >> --with-cuda=1 --with-debugging=0 >> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> --with-fftw=0 >> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 >> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 >> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 >> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 >> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so >> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" >> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= >> PETSC_ARCH=arch-linux-c-opt >> ----------------------------------------- >> Libraries compiled on 2022-01-21 06:41:50 on lassen111 >> Machine characteristics: >> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo >> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install >> Using PETSc arch: >> ----------------------------------------- >> >> Using C compiler: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> -g -DNoChange -fPIC "-O3" >> Using Fortran compiler: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> -g -fPIC >> ----------------------------------------- >> >> Using include paths: >> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include >> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include >> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include >> ----------------------------------------- >> >> Using C linker: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> Using Fortran linker: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> Using libraries: >> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib >> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc >> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 >> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 >> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl >> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse >> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi >> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath >> -lpthread -lquadmath -lstdc++ -ldl >> ----------------------------------------- >> ``` >> >> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini < >> [email protected]> wrote: >> >>> >>> >>> 1) It uses MatMPIDenseScatter() to move to the other ranks their needed >>>> rows of the C matrix. That function has the call MatDenseGetArrayRead() >>>> normally would trigger a copy of C up to the CPU each time. But since C is >>>> not changing in your test run I guess it only triggers one copy. >>>> >>>> 2) If uses >>>> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr); >>>> to do the off diagonal part of the product but this triggers for each >>>> multiply a copy of the result matrix from the CPU to the GPU (hugely >>>> expensive) >>>> >>>> For performance there needs to be a new routine >>>> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() >>>> that is smarter about the needed MPI communication so it only moves exactly >>>> what it needs to the other ranks and it does the off-diagonal part of the >>>> product on the GPU so it does not need to copy the result up to the CPU. >>>> >>>> >>> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND >>> >>> Rohan >>> I would suggest to add PetscLogStage around your performance loop (do a >>> warmup outside of it) and send the relevant portion of the log >>> >>> >>>> Barry >>>> >>>> >>>> >>>> >>>> >>>> >>>> ---------------------------------------------- PETSc Performance Summary: >>>> ---------------------------------------------- >>>> >>>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 >>>> processors, by yadav2 Wed Feb 2 17:23:19 2022 >>>> Using Petsc Release Version 3.16.3, unknown >>>> >>>> Max Max/Min Avg Total >>>> Time (sec): 1.163e+02 1.000 1.163e+02 >>>> Objects: 4.800e+01 1.000 4.800e+01 >>>> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12 >>>> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10 >>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>>> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09 >>>> MPI Reductions: 8.100e+01 1.000 >>>> >>>> Flop counting convention: 1 flop = 1 real number operation of type >>>> (multiply/divide/add/subtract) >>>> e.g., VecAXPY() for real vectors of length N >>>> --> 2N flop >>>> and VecAXPY() for complex vectors of length N >>>> --> 8N flop >>>> >>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >>>> --- -- Message Lengths -- -- Reductions -- >>>> Avg %Total Avg %Total Count >>>> %Total Avg %Total Count %Total >>>> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 >>>> 100.0% 7.267e+07 100.0% 6.300e+01 77.8% >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> See the 'Profiling' chapter of the users' manual for details on >>>> interpreting output. >>>> Phase summary info: >>>> Count: number of times phase was executed >>>> Time and Flop: Max - maximum over all processors >>>> Ratio - ratio of maximum to minimum over all processors >>>> Mess: number of messages sent >>>> AvgLen: average message length (bytes) >>>> Reduct: number of global reductions >>>> Global: entire computation >>>> Stage: stages of a computation. Set stages with PetscLogStagePush() and >>>> PetscLogStagePop(). >>>> %T - percent time in this phase %F - percent flop in this >>>> phase >>>> %M - percent messages in this phase %L - percent message lengths >>>> in this phase >>>> %R - percent reductions in this phase >>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over >>>> all processors) >>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >>>> time over all processors) >>>> CpuToGpu Count: total number of CPU to GPU copies per processor >>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>>> processor) >>>> GpuToCpu Count: total number of GPU to CPU copies per processor >>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>>> processor) >>>> GPU %F: percent flops on GPU in this event >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>> GpuToCpu - GPU >>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>> Count Size %F >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 >>>> 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 >>>> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >>>> 0.00e+00 0 0.00e+00 0 >>>> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 >>>> 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 >>>> 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 >>>> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 >>>> 8.73e+04 98 >>>> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 >>>> 0.00e+00 0 >>>> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 >>>> 0.00e+00 0 >>>> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 >>>> 9.02e+04 0 >>>> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 >>>> 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> Memory usage is given in bytes: >>>> >>>> Object Type Creations Destructions Memory Descendants' Mem. >>>> Reports information only for process 0. >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> Matrix 37 30 2867511840 0. >>>> Viewer 2 0 0 0. >>>> Vector 4 1 1792 0. >>>> Index Set 2 2 1495248 0. >>>> Star Forest Graph 3 0 0 0. >>>> ======================================================================================================================== >>>> Average time to get PetscTime(): 3.83e-08 >>>> Average time for MPI_Barrier(): 7.874e-07 >>>> Average time for zero size MPI_Send(): 3.4035e-06 >>>> #PETSc Option Table entries: >>>> -bench spmm >>>> -enable_gpu >>>> -log_view >>>> -mat_type aijcusparse >>>> -matload_block_size 1 >>>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc >>>> -n 20 >>>> -vec_type cuda >>>> -warmup 10 >>>> ``` >>>> >>>> >>>> Thanks, >>>> >>>> >>>> Rohan Yadav >>>> >>>> >>>> >>>> >>> >>> -- >>> Stefano >>> >> >> >
