Alright, thanks for the help everyone. Rohan
On Thu, Feb 3, 2022 at 2:01 PM Barry Smith <[email protected]> wrote: > > > On Feb 3, 2022, at 4:28 PM, Rohan Yadav <[email protected]> wrote: > > To be concrete, the first matrix was > https://sparse.tamu.edu/LAW/arabic-2005 and the second was > https://sparse.tamu.edu/Schenk/nlpkkt200 (which looks like it does come > from the PDE domain?). > > > You are correct; but the matrix bandwidth is so huge (see how far that > off diagonal that second band of nonzeros is is) for two ranks this means > the each MPI rank ends up needing the entire right hand side matrix to do > the computation. Plus the 0 in the lower diagonal block means that on the > second rank there is essentially no work to be done on the GPU at all. > > > > Regardless of the non-zero structure, there is still a significant hit > when moving from 1 gpu to multiple GPUs that causes a large number of > device to host copies to be performed. If this is a result of the PETSc > implementation thats fine -- but if there's something I can do to work > around that it would be great. > > > I don't understand enough about the code that Stefeno pointed to know > how easy the performance problem would be to fix in PETSc for sparse times > dense matrix product It would still be a problem for the nlpkkt200 > <https://sparse.tamu.edu/Schenk/nlpkkt200> as partitioned on two ranks > no matter what, but in theory the problem can be fixed by improving the > PETSc code. > > I recommend finding a different sparse matrix test case which has an > appropriate nonzero data structuring and partitioning that one can expect > good performance on. > > > > > Rohan > > On Thu, Feb 3, 2022 at 1:25 PM Barry Smith <[email protected]> wrote: > >> >> I suspect the new matrix has a very different parallel nonzero >> structure that results in MOST of the calculations taking place on the CPU >> (since the "off-diagonal" part of the matrix dominates the non-zero >> pattern). PETSc is not designed for this type of nonzero structure and will >> give a bad performance (CPU or GPU); it is not a "PDE-ish" type of nonzero >> structure. >> >> >> >> On Feb 3, 2022, at 2:59 PM, Rohan Yadav <[email protected]> wrote: >> >> I'm sorry, I did a little switch here. The original log view I sent for 2 >> runs was on a different input matrix. Based on Barry's request I switched >> to a different matrix as the original one did not fit on 1 GPU. >> >> > In the previously sent runs it was about 98% on GPU. >> >> Re 98% on the GPU though, my first email had a similar ratio in the log >> though: >> ``` >> >> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 >> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 30 >> 8.73e+04 98 >> >> ``` >> >> The follow up log might be slightly different as well because I pushed a new >> log stage as requested by Stefano. >> >> >> Rohan >> >> >> On Thu, Feb 3, 2022 at 11:50 AM Barry Smith <[email protected]> wrote: >> >>> >>> Mark, >>> >>> Good eye. Something is definitely very different between this run >>> and the previous (options, code change?). In the previously sent runs it >>> was about 98% on GPU. >>> >>> Barry >>> >>> >>> On Feb 3, 2022, at 12:29 PM, Rohan Yadav <[email protected]> wrote: >>> >>> > Please send the code that builds the sparse B matrix and the >>> > setMatToConstant() >>> routine. >>> >>> Setting to a constant: >>> ``` >>> void setMatToConstant(Mat mat, PetscScalar c) { >>> >>> PetscInt rStart, rEnd, m, n; >>> MatGetSize(mat, &m, &n); >>> MatGetOwnershipRange(mat, &rStart, &rEnd); >>> for (int i = rStart; i < rEnd; i++) { >>> for (int j = 0; j < n; j++) { >>> MatSetValue(mat, i, j, c, INSERT_VALUES); >>> } >>> } >>> MatAssemblyBegin(mat, MAT_FINAL_ASSEMBLY); >>> MatAssemblyEnd(mat, MAT_FINAL_ASSEMBLY); >>> } >>> ``` >>> >>> Loading sparse matrix from disk: >>> >>> ``` >>> >>> int loadMatrixFromFile(Mat* A, char* filename) { >>> auto ierr = MatCreate(PETSC_COMM_WORLD, A); CHKERRQ(ierr); >>> MatSetFromOptions(*A); >>> PetscViewer viewer; >>> PetscViewerCreate(PETSC_COMM_WORLD, &viewer); >>> PetscViewerSetType(viewer, PETSCVIEWERBINARY); >>> PetscViewerFileSetMode(viewer, FILE_MODE_READ); >>> PetscViewerFileSetName(viewer, filename); >>> MatLoad(*A, viewer); >>> return 0; >>> } >>> >>> ``` >>> >>> These are only called once and should not affect the computation in a loop >>> though. >>> >>> > But first please verify that if you run with one MPI rank the "on GPU" >>> > and the overall flop rates for the MatMatMult() are almost the same and >>> > there is no copy from the GPU for each multiply? >>> >>> >>> Yes, with 1 mpi rank / GPU there are no extra copies done. As soon as I >>> move to 2 ranks I see this behavior. >>> >>> Here are updated logs with a new stage for 2 ranks. I've staged the logs >>> into "MyComputation". >>> >>> ``` >>> ---------------------------------------------- PETSc Performance >>> Summary: ---------------------------------------------- >>> >>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen572 with 2 >>> processors, by yadav2 Thu Feb 3 09:27:30 2022 >>> Using Petsc Release Version 3.16.3, unknown >>> >>> Max Max/Min Avg Total >>> Time (sec): 2.091e+02 1.001 2.090e+02 >>> Objects: 4.800e+01 1.000 4.800e+01 >>> Flop: 4.344e+11 1.019 4.303e+11 8.606e+11 >>> Flop/sec: 2.077e+09 1.018 2.059e+09 4.118e+09 >>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>> MPI Message Lengths: 6.316e+10 1.000 1.805e+09 1.263e+11 >>> MPI Reductions: 8.100e+01 1.000 >>> >>> Flop counting convention: 1 flop = 1 real number operation of type >>> (multiply/divide/add/subtract) >>> e.g., VecAXPY() for real vectors of length N >>> --> 2N flop >>> and VecAXPY() for complex vectors of length >>> N --> 8N flop >>> >>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >>> --- -- Message Lengths -- -- Reductions -- >>> Avg %Total Avg %Total Count >>> %Total Avg %Total Count %Total >>> 0: Main Stage: 1.0555e+02 50.5% 2.8686e+11 33.3% 3.000e+01 >>> 42.9% 1.466e+09 34.8% 4.300e+01 53.1% >>> 1: MyComputation: 1.0345e+02 49.5% 5.7373e+11 66.7% 4.000e+01 >>> 57.1% 2.058e+09 65.2% 2.000e+01 24.7% >>> >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> See the 'Profiling' chapter of the users' manual for details on >>> interpreting output. >>> Phase summary info: >>> Count: number of times phase was executed >>> Time and Flop: Max - maximum over all processors >>> Ratio - ratio of maximum to minimum over all processors >>> Mess: number of messages sent >>> AvgLen: average message length (bytes) >>> Reduct: number of global reductions >>> Global: entire computation >>> Stage: stages of a computation. Set stages with PetscLogStagePush() >>> and PetscLogStagePop(). >>> %T - percent time in this phase %F - percent flop in this >>> phase >>> %M - percent messages in this phase %L - percent message >>> lengths in this phase >>> %R - percent reductions in this phase >>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >>> over all processors) >>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max >>> GPU time over all processors) >>> CpuToGpu Count: total number of CPU to GPU copies per processor >>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>> processor) >>> GpuToCpu Count: total number of GPU to CPU copies per processor >>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>> processor) >>> GPU %F: percent flops on GPU in this event >>> >>> ------------------------------------------------------------------------------------------------------------------------ >>> Event Count Time (sec) Flop >>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>> GpuToCpu - GPU >>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>> Count Size %F >>> >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> --- Event Stage 0: Main Stage >>> >>> BuildTwoSided 2 1.0 4.0085e-0136.3 0.00e+00 0.0 2.0e+00 4.0e+00 >>> 2.0e+00 0 0 3 0 2 0 0 7 0 5 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> BuildTwoSidedF 1 1.0 4.0080e-0113602.0 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >>> 0.00e+00 0 0.00e+00 0 >>> MatAssemblyBegin 12 1.0 4.0084e-017217.1 0.00e+00 0.0 0.0e+00 >>> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >>> 0.00e+00 0 0.00e+00 0 >>> MatAssemblyEnd 12 1.0 3.4970e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 6.0e+00 2 0 0 0 7 3 0 0 0 14 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatZeroEntries 1 1.0 2.4093e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatLoad 1 1.0 1.3756e+01 1.0 0.00e+00 0.0 6.0e+00 4.6e+08 >>> 2.1e+01 7 0 9 2 26 13 0 20 6 49 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultSym 20 1.0 4.7919e+00 2.4 0.00e+00 0.0 4.0e+00 1.6e+07 >>> 1.2e+01 2 0 6 0 15 3 0 13 0 28 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultNum 10 1.0 4.9853e+01 1.1 1.45e+11 1.0 2.0e+01 2.1e+09 >>> 0.0e+00 23 33 29 33 0 46100 67 94 0 5754 182686 2 2.23e+03 10 >>> 2.08e+04 5 >>> MatCUSPARSCopyTo 1 1.0 2.2646e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 1.55e+02 0 >>> 0.00e+00 0 >>> MatDenseCopyTo 1 1.0 1.6636e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.08e+03 0 >>> 0.00e+00 0 >>> MatDenseCopyFrom 11 1.0 3.0463e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 1 0 0 0 0 3 0 0 0 0 0 0 0 0.00e+00 11 >>> 2.29e+04 0 >>> VecSet 3 1.0 5.0035e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> SFSetGraph 1 1.0 4.4294e-03 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> SFSetUp 1 1.0 1.3982e-01 1.0 0.00e+00 0.0 4.0e+00 1.6e+07 >>> 1.0e+00 0 0 6 0 1 0 0 13 0 2 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> >>> --- Event Stage 1: MyComputation >>> >>> MatAssemblyBegin 20 1.0 1.6894e-05 2.7 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatAssemblyEnd 20 1.0 1.5575e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultSym 40 1.0 1.0096e+01 2.6 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 2.0e+01 3 0 0 0 25 7 0 0 0100 0 0 0 0.00e+00 0 >>> 0.00e+00 0 >>> MatMatMultNum 20 1.0 9.9320e+01 1.1 2.90e+11 1.0 4.0e+01 2.1e+09 >>> 0.0e+00 46 67 57 65 0 93100100100 0 5777 182577 0 0.00e+00 20 >>> 4.16e+04 5 >>> MatDenseCopyFrom 20 1.0 5.5380e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>> 0.0e+00 3 0 0 0 0 5 0 0 0 0 0 0 0 0.00e+00 20 >>> 4.16e+04 0 >>> >>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>> >>> Memory usage is given in bytes: >>> >>> Object Type Creations Destructions Memory Descendants' >>> Mem. >>> Reports information only for process 0. >>> >>> --- Event Stage 0: Main Stage >>> >>> Matrix 17 10 20381695840 0. >>> Viewer 2 0 0 0. >>> Vector 4 1 1792 0. >>> Index Set 2 2 31848152 0. >>> Star Forest Graph 3 0 0 0. >>> >>> --- Event Stage 1: MyComputation >>> >>> Matrix 20 20 40763391680 0. >>> >>> ======================================================================================================================== >>> Average time to get PetscTime(): 3.96e-08 >>> Average time for MPI_Barrier(): 8.184e-07 >>> Average time for zero size MPI_Send(): 2.8165e-06 >>> #PETSc Option Table entries: >>> -bench spmm >>> -enable_gpu >>> -log_view >>> -mat_type aijcusparse >>> -matload_block_size 1 >>> -matrix /p/gpfs1/yadav2/tensors/petsc/nlpkkt200.petsc >>> -n 20 >>> -vec_type cuda >>> -warmup 10 >>> #End of PETSc Option Table entries >>> Compiled without FORTRAN kernels >>> Compiled with full precision matrices (default) >>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 >>> Configure options: --download-c2html=0 --download-hwloc=0 >>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 >>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so >>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" >>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>> --with-clanguage=C --with-cxx-dialect=C++17 >>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ >>> --with-cuda=1 --with-debugging=0 >>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>> --with-fftw=0 >>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 >>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 >>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 >>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 >>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so >>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" >>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= >>> PETSC_ARCH=arch-linux-c-opt >>> ----------------------------------------- >>> Libraries compiled on 2022-01-21 06:41:50 on lassen111 >>> Machine characteristics: >>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo >>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install >>> Using PETSc arch: >>> ----------------------------------------- >>> >>> Using C compiler: >>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>> -g -DNoChange -fPIC "-O3" >>> Using Fortran compiler: >>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>> -g -fPIC >>> ----------------------------------------- >>> >>> Using include paths: >>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include >>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include >>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include >>> ----------------------------------------- >>> >>> Using C linker: >>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>> Using Fortran linker: >>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>> Using libraries: >>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib >>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc >>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 >>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 >>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib >>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl >>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse >>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi >>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath >>> -lpthread -lquadmath -lstdc++ -ldl >>> ----------------------------------------- >>> ``` >>> >>> On Wed, Feb 2, 2022 at 11:59 PM Stefano Zampini < >>> [email protected]> wrote: >>> >>>> >>>> >>>> 1) It uses MatMPIDenseScatter() to move to the other ranks their >>>>> needed rows of the C matrix. That function has the call >>>>> MatDenseGetArrayRead() normally would trigger a copy of C up to the CPU >>>>> each time. But since C is not changing in your test run I guess it only >>>>> triggers one copy. >>>>> >>>>> 2) If uses >>>>> MatMatMultNumericAdd_SeqAIJ_SeqDense(aij->B,workB,cdense->A,PETSC_TRUE);CHKERRQ(ierr); >>>>> to do the off diagonal part of the product but this triggers for each >>>>> multiply a copy of the result matrix from the CPU to the GPU (hugely >>>>> expensive) >>>>> >>>>> For performance there needs to be a new routine >>>>> MatMatMultNumeric_MPIAIJCUSPRSE_MPICUDADense() >>>>> that is smarter about the needed MPI communication so it only moves >>>>> exactly >>>>> what it needs to the other ranks and it does the off-diagonal part of the >>>>> product on the GPU so it does not need to copy the result up to the CPU. >>>>> >>>>> >>>> MPIAIJCUSPARSE uses MatProductSetFromOptions_MPIAIJBACKEND >>>> >>>> Rohan >>>> I would suggest to add PetscLogStage around your performance loop (do a >>>> warmup outside of it) and send the relevant portion of the log >>>> >>>> >>>>> Barry >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> ---------------------------------------------- PETSc Performance Summary: >>>>> ---------------------------------------------- >>>>> >>>>> /g/g15/yadav2/taco/petsc/bin/benchmark on a named lassen457 with 2 >>>>> processors, by yadav2 Wed Feb 2 17:23:19 2022 >>>>> Using Petsc Release Version 3.16.3, unknown >>>>> >>>>> Max Max/Min Avg Total >>>>> Time (sec): 1.163e+02 1.000 1.163e+02 >>>>> Objects: 4.800e+01 1.000 4.800e+01 >>>>> Flop: 6.338e+11 1.065 6.144e+11 1.229e+12 >>>>> Flop/sec: 5.451e+09 1.065 5.284e+09 1.057e+10 >>>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>>>> MPI Message Lengths: 2.544e+09 1.000 7.267e+07 5.087e+09 >>>>> MPI Reductions: 8.100e+01 1.000 >>>>> >>>>> Flop counting convention: 1 flop = 1 real number operation of type >>>>> (multiply/divide/add/subtract) >>>>> e.g., VecAXPY() for real vectors of length N >>>>> --> 2N flop >>>>> and VecAXPY() for complex vectors of length N >>>>> --> 8N flop >>>>> >>>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >>>>> --- -- Message Lengths -- -- Reductions -- >>>>> Avg %Total Avg %Total Count >>>>> %Total Avg %Total Count %Total >>>>> 0: Main Stage: 1.1628e+02 100.0% 1.2288e+12 100.0% 7.000e+01 >>>>> 100.0% 7.267e+07 100.0% 6.300e+01 77.8% >>>>> >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> See the 'Profiling' chapter of the users' manual for details on >>>>> interpreting output. >>>>> Phase summary info: >>>>> Count: number of times phase was executed >>>>> Time and Flop: Max - maximum over all processors >>>>> Ratio - ratio of maximum to minimum over all processors >>>>> Mess: number of messages sent >>>>> AvgLen: average message length (bytes) >>>>> Reduct: number of global reductions >>>>> Global: entire computation >>>>> Stage: stages of a computation. Set stages with PetscLogStagePush() >>>>> and PetscLogStagePop(). >>>>> %T - percent time in this phase %F - percent flop in this >>>>> phase >>>>> %M - percent messages in this phase %L - percent message >>>>> lengths in this phase >>>>> %R - percent reductions in this phase >>>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >>>>> over all processors) >>>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >>>>> time over all processors) >>>>> CpuToGpu Count: total number of CPU to GPU copies per processor >>>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>>>> processor) >>>>> GpuToCpu Count: total number of GPU to CPU copies per processor >>>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>>>> processor) >>>>> GPU %F: percent flops on GPU in this event >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> Event Count Time (sec) Flop >>>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>>> GpuToCpu - GPU >>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>>> Count Size %F >>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>>> >>>>> --- Event Stage 0: Main Stage >>>>> >>>>> BuildTwoSided 2 1.0 4.4400e-01567.5 0.00e+00 0.0 2.0e+00 4.0e+00 >>>>> 2.0e+00 0 0 3 0 2 0 0 3 0 3 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> BuildTwoSidedF 1 1.0 4.4395e-0115659.1 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >>>>> 0.00e+00 0 0.00e+00 0 >>>>> MatAssemblyBegin 32 1.0 4.4400e-017378.9 0.00e+00 0.0 0.0e+00 >>>>> 0.0e+00 1.0e+00 0 0 0 0 1 0 0 0 0 2 0 0 0 >>>>> 0.00e+00 0 0.00e+00 0 >>>>> MatAssemblyEnd 32 1.0 1.8511e+00 2.2 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 6.0e+00 1 0 0 0 7 1 0 0 0 10 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> MatZeroEntries 1 1.0 3.3306e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> MatLoad 1 1.0 1.7220e+01 1.0 0.00e+00 0.0 6.0e+00 -8.8e+07 >>>>> 2.1e+01 15 0 9-10 26 15 0 9-10 33 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> MatMatMultSym 60 1.0 9.2215e-01 2.6 0.00e+00 0.0 4.0e+00 7.3e+05 >>>>> 3.2e+01 1 0 6 0 40 1 0 6 0 51 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> MatMatMultNum 30 1.0 4.2967e+01 1.0 6.34e+11 1.1 6.0e+01 9.4e+07 >>>>> 0.0e+00 37100 86110 0 37100 86110 0 28598 920026 2 6.71e+03 >>>>> 30 8.73e+04 98 >>>>> MatCUSPARSCopyTo 1 1.0 4.4761e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 3.80e+03 0 >>>>> 0.00e+00 0 >>>>> MatDenseCopyTo 1 1.0 2.2742e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 1 2.91e+03 0 >>>>> 0.00e+00 0 >>>>> MatDenseCopyFrom 31 1.0 1.2006e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 10 0 0 0 0 10 0 0 0 0 0 0 0 0.00e+00 31 >>>>> 9.02e+04 0 >>>>> VecSet 3 1.0 4.1917e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> SFSetGraph 1 1.0 1.9180e-04 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> SFSetUp 1 1.0 1.3672e-02 1.1 0.00e+00 0.0 4.0e+00 7.3e+05 >>>>> 1.0e+00 0 0 6 0 1 0 0 6 0 2 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>>> >>>>> Memory usage is given in bytes: >>>>> >>>>> Object Type Creations Destructions Memory Descendants' >>>>> Mem. >>>>> Reports information only for process 0. >>>>> >>>>> --- Event Stage 0: Main Stage >>>>> >>>>> Matrix 37 30 2867511840 0. >>>>> Viewer 2 0 0 0. >>>>> Vector 4 1 1792 0. >>>>> Index Set 2 2 1495248 0. >>>>> Star Forest Graph 3 0 0 0. >>>>> ======================================================================================================================== >>>>> Average time to get PetscTime(): 3.83e-08 >>>>> Average time for MPI_Barrier(): 7.874e-07 >>>>> Average time for zero size MPI_Send(): 3.4035e-06 >>>>> #PETSc Option Table entries: >>>>> -bench spmm >>>>> -enable_gpu >>>>> -log_view >>>>> -mat_type aijcusparse >>>>> -matload_block_size 1 >>>>> -matrix /p/gpfs1/yadav2/tensors/petsc/arabic-2005.petsc >>>>> -n 20 >>>>> -vec_type cuda >>>>> -warmup 10 >>>>> ``` >>>>> >>>>> >>>>> Thanks, >>>>> >>>>> >>>>> Rohan Yadav >>>>> >>>>> >>>>> >>>>> >>>> >>>> -- >>>> Stefano >>>> >>> >>> >> >
