By warm up, I mean run the target computation a few times running the timed iterations.
Rohan On Fri, Jan 14, 2022 at 3:18 PM Mark Adams <mfad...@lbl.gov> wrote: > If you do a warm up you can probably warm it up with another method (like > MatTransposeMult) to keep your timing separe. It's not clear to me what you > mean by "warm up" exactly. > > OK, I see you got the GPUs working. > > On Fri, Jan 14, 2022 at 4:59 PM Rohan Yadav <roh...@alumni.cmu.edu> wrote: > >> A log_view is attached at the end of the mail. >> >> I am running on a large problem size (639 million nonzeros). >> >> > * I assume you are assembling the matrix on the CPU. The copy of data >> to the GPU takes time and you really should be creating the matrix on the >> GPU >> >> How do I do this? I'm loading the matrix in from a file, but I'm running >> the computation several times (and with a warmup), so I would expect that >> the data is copied onto the GPU the first time. My (cpu) code to do this is >> here: >> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68 >> . >> >> Log view: >> >> ---------------------------------------------- PETSc Performance Summary: >> ---------------------------------------------- >> >> ./bin/benchmark on a named lassen75 with 2 processors, by yadav2 Fri Jan >> 14 13:54:09 2022 >> Using Petsc Release Version 3.16.3, unknown >> >> Max Max/Min Avg Total >> Time (sec): 1.026e+02 1.000 1.026e+02 >> Objects: 1.200e+01 1.000 1.200e+01 >> Flop: 1.156e+10 1.009 1.151e+10 2.303e+10 >> Flop/sec: 1.127e+08 1.009 1.122e+08 2.245e+08 >> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >> MPI Message Lengths: 2.210e+10 1.000 6.313e+08 4.419e+10 >> MPI Reductions: 4.100e+01 1.000 >> >> Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract) >> e.g., VecAXPY() for real vectors of length N >> --> 2N flop >> and VecAXPY() for complex vectors of length N >> --> 8N flop >> >> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >> --- -- Message Lengths -- -- Reductions -- >> Avg %Total Avg %Total Count >> %Total Avg %Total Count %Total >> 0: Main Stage: 1.0257e+02 100.0% 2.3025e+10 100.0% 7.000e+01 >> 100.0% 6.313e+08 100.0% 2.300e+01 56.1% >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> See the 'Profiling' chapter of the users' manual for details on >> interpreting output. >> Phase summary info: >> Count: number of times phase was executed >> Time and Flop: Max - maximum over all processors >> Ratio - ratio of maximum to minimum over all processors >> Mess: number of messages sent >> AvgLen: average message length (bytes) >> Reduct: number of global reductions >> Global: entire computation >> Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop(). >> %T - percent time in this phase %F - percent flop in this >> phase >> %M - percent messages in this phase %L - percent message >> lengths in this phase >> %R - percent reductions in this phase >> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >> over all processors) >> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >> time over all processors) >> CpuToGpu Count: total number of CPU to GPU copies per processor >> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >> processor) >> GpuToCpu Count: total number of GPU to CPU copies per processor >> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >> processor) >> GPU %F: percent flops on GPU in this event >> >> ------------------------------------------------------------------------------------------------------------------------ >> Event Count Time (sec) Flop >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >> GpuToCpu - GPU >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >> Count Size %F >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> --- Event Stage 0: Main Stage >> >> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 >> 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0 >> 0.00e+00 0 0.00e+00 0 >> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 >> 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 0.00e+00 31 >> 2.65e+04 0 >> MatAssemblyBegin 1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatAssemblyEnd 1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 4.0e+00 19 0 0 0 10 19 0 0 0 17 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> MatLoad 1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08 >> 1.6e+01 35 0 9 7 39 35 0 9 7 70 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecSet 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterBegin 30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08 >> 1.0e+00 5 0 91 93 2 5 0 91 93 4 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterEnd 30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecCUDACopyFrom 31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 31 >> 2.65e+04 0 >> SFSetGraph 1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFSetUp 1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08 >> 1.0e+00 5 0 6 2 2 5 0 6 2 4 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFPack 30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> SFUnpack 30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> Memory usage is given in bytes: >> >> Object Type Creations Destructions Memory Descendants' >> Mem. >> Reports information only for process 0. >> >> --- Event Stage 0: Main Stage >> >> Matrix 3 0 0 0. >> Viewer 2 0 0 0. >> Vector 4 1 1792 0. >> Index Set 2 2 335250404 0. >> Star Forest Graph 1 0 0 0. >> >> ======================================================================================================================== >> Average time to get PetscTime(): 3.77e-08 >> Average time for MPI_Barrier(): 8.754e-07 >> Average time for zero size MPI_Send(): 2.6755e-06 >> #PETSc Option Table entries: >> -log_view >> -mat_type aijcusparse >> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc >> -n 20 >> -vec_type cuda >> -warmup 10 >> #End of PETSc Option Table entries >> Compiled without FORTRAN kernels >> Compiled with full precision matrices (default) >> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 >> Configure options: --download-c2html=0 --download-hwloc=0 >> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 >> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so >> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" >> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> --with-clanguage=C --with-cxx-dialect=C++17 >> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ >> --with-cuda=1 --with-debugging=0 >> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> --with-fftw=0 >> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 >> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 >> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 >> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 >> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so >> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" >> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= >> PETSC_ARCH=arch-linux-c-opt >> ----------------------------------------- >> Libraries compiled on 2022-01-14 20:56:04 on lassen99 >> Machine characteristics: >> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo >> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install >> Using PETSc arch: >> ----------------------------------------- >> >> Using C compiler: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> -g -DNoChange -fPIC "-O3" >> Using Fortran compiler: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> -g -fPIC >> ----------------------------------------- >> >> Using include paths: >> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include >> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include >> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include >> ----------------------------------------- >> >> Using C linker: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >> Using Fortran linker: >> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >> Using libraries: >> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib >> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc >> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 >> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 >> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib >> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl >> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse >> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi >> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath >> -lpthread -lquadmath -lstdc++ -ldl >> ----------------------------------------- >> >> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfad...@lbl.gov> wrote: >> >>> There are a few things: >>> * GPU have higher latencies and so you basically need a large >>> enough problem to get GPU speedup >>> * I assume you are assembling the matrix on the CPU. The copy of data to >>> the GPU takes time and you really should be creating the matrix on the GPU >>> * I agree with Barry, Roughly 1M / GPU is around where you start seeing >>> a win but this depends on a lot of things. >>> * There are startup costs, like the CPU-GPU copy. It is best to run one >>> mat-vec, or whatever, push a new stage and then run the benchmark. The >>> timing for this new stage will be separate in the log view data. Look at >>> that. >>> - You can fake this by running your benchmark many times to amortize >>> any setup costs. >>> >>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <roh...@alumni.cmu.edu> >>> wrote: >>> >>>> Hi, >>>> >>>> I'm looking to use PETSc with GPUs to do some linear algebra >>>> operations, like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and >>>> running with `-mat_type aijcusparse -vec_type cuda` gives me a large >>>> slowdown from the same code running on the CPU. This is not entirely >>>> unexpected, as things like data transfer costs across the PCIE might >>>> erroneously be included in my timing. Are there some examples of >>>> benchmarking GPU computations with PETSc, or just the proper way to write >>>> code in PETSc that will work for CPUs and GPUs? >>>> >>>> Rohan >>>> >>>