> On Jan 20, 2022, at 5:24 PM, Rohan Yadav <roh...@alumni.cmu.edu> wrote: > > Thanks barry, this is what I was looking for. However, it doesn't seem to be > working for me (the reported times are significantly different still with > -log_view on and off).
I think this is because your second loop will overlap the additional kernel launches with the GPU computations without -log_view but will not overlap with the -log_view (since -log_view forces each MatMult to end before the next one can be launched by the CPU). If you put the PetscLogGpuTimeBegin/End within the loop then the -log_view should have much less effect. But I am not sure exactly what will happen with them inside the loop and with -log_view since there will be "extra" PetscLogGpuTimeEnd synchronization points; I don't think they will matter but I cannot say for sure. Like I said, tricky. > Here is my exact timing code: > ``` > double avgTime = 0.0; > { > PetscLogDouble start, end; > PetscLogGpuTimeBegin(); > for (int i = 0; i < warmup; i++) { > MatMult(A, x, y); > } > PetscLogGpuTimeEnd(); > PetscLogGpuTimeBegin(); > PetscTime(&start); > for (int i = 0; i < niter; i++) { > MatMult(A, x, y); > } > PetscLogGpuTimeEnd(); > PetscTime(&end); > auto sec = end - start; > avgTime = double(sec) / double(niter); > } > ``` > I'm measuring the time for a group of MatMult's as you suggested (with some > warmup iterations). > > Rohan > > On Thu, Jan 20, 2022 at 1:42 PM Barry Smith <bsm...@petsc.dev > <mailto:bsm...@petsc.dev>> wrote: > > Some operations on the GPU are asynchronous, the CPU passes the kernel > launch to the GPU and then immediately returns ready to do something else > before the kernel is completed (or even started). Some like VecDot() where > the result is stored in a CPU memory have to block until the kernel is > complete and the result copied up to the CPU. > > -log_view forces a the calls to PetscLogGpuTimeEnd() which has (for CUDA) > > cerr = > cudaEventRecord(petsc_gputimer_end,PetscDefaultCudaStream);CHKERRCUDA(cerr); > cerr = cudaEventSynchronize(petsc_gputimer_end);CHKERRCUDA(cerr); > cerr = > cudaEventElapsedTime(>ime,petsc_gputimer_begin,petsc_gputimer_end);CHKERRCUDA(cerr); > petsc_gtime += (PetscLogDouble)gtime/1000.0; /* convert milliseconds to > seconds */ > > which essentially causes the CPU to wait until the kernel is complete, hence > your time with -log_view captures the full time to run the kernel. > > So timing with GPUs can be a tricky business (when do you want to block and > when do you not?) For your loop, you may want to use > > PetscLogGpuTimeBegin() >> start = now() >> >> for (int i = 0; i < 10; i++) { >> MatMult(A, x, y); >> } > PetscLogGpuTimeEnd() >> end = now() >> print(end - start / 10) >> ``` > > > Now after the loop it will wait until all the multiplies are completely done; > giving a better view of the time it takes. If you did > > >> start = now() >> >> for (int i = 0; i < 10; i++) { > PetscLogGpuTimeBegin() >> MatMult(A, x, y); > PetscLogGpuTimeEnd() >> } >> end = now() >> print(end - start / 10) >> ``` > > You would wait a longer time because the CPU could not tell the GPU about the > second kernel launch until the first kernel is completely done. Hence there > would be no overlap of GPU computation and CPU kernel launches (which take a > long time). > > IMHO timing individual operations like a single MatMult() on GPUs only has a > certain level of usefulness since you slow down the computation (by removing > the asynchronous nature between the GPU and CPU) in order to get accurate > times. It is better to time something like a complete line solver, nonlinear > solve etc and not log at a finer granularity. > > Barry > > > > > >> On Jan 20, 2022, at 4:07 PM, Rohan Yadav <roh...@alumni.cmu.edu >> <mailto:roh...@alumni.cmu.edu>> wrote: >> >> Another small question -- I'm a little confused around timing GPU codes with >> PETSc. I have a code that looks like: >> ``` >> start = now() >> for (int i = 0; i < 10; i++) { >> MatMult(A, x, y); >> } >> end = now() >> print(end - start / 10) >> ``` >> >> If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs >> are indeed utilized, but the recorded time is very tiny (i imagine just >> tracking the cost of launching cuda kernels). However, if I add `-log_view` >> to the command line arguments, then the resulting time printed matches what >> is recorded by `nvprof`. What is the correct way to benchmark PETSc with >> GPUs without having -log_view turned on? >> >> Thanks, >> >> Rohan >> >> On Sat, Jan 15, 2022 at 7:37 AM Barry Smith <bsm...@petsc.dev >> <mailto:bsm...@petsc.dev>> wrote: >> >> Oh yes, you are correct for this operation since the handling of different >> nonzero pattern is not trivial to implement well for the GPU. >> >>> On Jan 15, 2022, at 1:17 AM, Rohan Yadav <roh...@alumni.cmu.edu >>> <mailto:roh...@alumni.cmu.edu>> wrote: >>> >>> Scanning the source code for mpiseqaijcusparse confirms my thoughts -- when >>> used with DIFFERENT_NONZERO_PATTERN, it falls back to calling >>> MatAXPY_SeqAIJ, copying the data back over to the host. >>> >>> Rohan >>> >>> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav <roh...@alumni.cmu.edu >>> <mailto:roh...@alumni.cmu.edu>> wrote: >>> >>> >>> ---------- Forwarded message --------- >>> From: Rohan Yadav <roh...@alumni.cmu.edu <mailto:roh...@alumni.cmu.edu>> >>> Date: Fri, Jan 14, 2022 at 10:03 PM >>> Subject: Re: [petsc-dev] Using PETSC with GPUs >>> To: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>> >>> >>> >>> Ok, I'll try looking with greps like and see what I find. >>> >>> > My guess why your code is not using the seqaijcusparse is that you are >>> > not setting the type before you call MatLoad() hence it loads with >>> > SeqAIJ. -mat_type does not magically change a type once a matrix has a >>> > set type. I agree our documentation on how to make objects be GPU objects >>> > is horrible now. >>> >>> I printed out my matrices with the PetscViewer objects and can confirm that >>> the type is seqaijcusparse. Perhaps for the way I'm using it >>> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to >>> get any more diagnostic info about why the cuda kernel isn't called... >>> >>> Rohan >>> >>> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsm...@petsc.dev >>> <mailto:bsm...@petsc.dev>> wrote: >>> >>> This changes rapidly and depends on if the backend is CUDA, HIP, Sycl, or >>> Kokkos. The only way to find out definitively is with, for example, >>> >>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)" >>> >>> >>> Because of our, unfortunately, earlier naming choices you need to kind of >>> know what to grep for, for CUDA it may be cuSparse or cuBLAS >>> >>> Not yet merged branches may also have some operations that are still >>> being developed. >>> >>> My guess why your code is not using the seqaijcusparse is that you are >>> not setting the type before you call MatLoad() hence it loads with SeqAIJ. >>> -mat_type does not magically change a type once a matrix has a set type. I >>> agree our documentation on how to make objects be GPU objects is horrible >>> now. >>> >>> Barry >>> >>> >>>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <roh...@alumni.cmu.edu >>>> <mailto:roh...@alumni.cmu.edu>> wrote: >>>> >>>> I was wondering if there is a definitive list for what operations are and >>>> aren't supported for distributed GPU execution. For some operations, like >>>> `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the >>>> documentation, but other operations it is unclear, such as MatMatMult. >>>> Another scenario is the MatAXPY kernel, which supposedly has a >>>> SeqAIJCUSPARSE implementation, which I take means that it can only execute >>>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the >>>> kernel it doesn't seem to utilize the GPU. >>>> >>>> Rohan >>>> >>>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsm...@petsc.dev >>>> <mailto:bsm...@petsc.dev>> wrote: >>>> >>>> Just use 1 MPI rank. >>>> >>>> >>>> ------------------------------------------------------------------------------------------------------------------------ >>>> Event Count Time (sec) Flop >>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>> GpuToCpu - GPU >>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>> Count Size %F >>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>> >>>> --- Event Stage 0: Main Stage >>>> >>>> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 4.0e+00 >>>> 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0 0.00e+00 0 >>>> 0.00e+00 0 >>>> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 >>>> 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 0.00e+00 31 >>>> 2.65e+04 0 >>>> >>>> From this it is clear the matrix never ended up on the GPU, but the vector >>>> did. For each multiply, it is copying the vector from the GPU to the CPU >>>> and then doing the MatMult on the CPU. If the MatMult was done on the GPU >>>> the file number in the row would be 100% indicating all the flops were >>>> done on the GPU and the fifth from the end value of 0 would be some large >>>> number, being the flop rate on the GPU. >>>> >>>> >>>> >>>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <roh...@alumni.cmu.edu >>>>> <mailto:roh...@alumni.cmu.edu>> wrote: >>>>> >>>>> A log_view is attached at the end of the mail. >>>>> >>>>> I am running on a large problem size (639 million nonzeros). >>>>> >>>>> > * I assume you are assembling the matrix on the CPU. The copy of data >>>>> > to the GPU takes time and you really should be creating the matrix on >>>>> > the GPU >>>>> >>>>> How do I do this? I'm loading the matrix in from a file, but I'm running >>>>> the computation several times (and with a warmup), so I would expect that >>>>> the data is copied onto the GPU the first time. My (cpu) code to do this >>>>> is here: >>>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68 >>>>> >>>>> <https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68>. >>>>> >>>>> Log view: >>>>> >>>>> ---------------------------------------------- PETSc Performance Summary: >>>>> ---------------------------------------------- >>>>> >>>>> ./bin/benchmark on a named lassen75 with 2 processors, by yadav2 Fri Jan >>>>> 14 13:54:09 2022 >>>>> Using Petsc Release Version 3.16.3, unknown >>>>> >>>>> Max Max/Min Avg Total >>>>> Time (sec): 1.026e+02 1.000 1.026e+02 >>>>> Objects: 1.200e+01 1.000 1.200e+01 >>>>> Flop: 1.156e+10 1.009 1.151e+10 2.303e+10 >>>>> Flop/sec: 1.127e+08 1.009 1.122e+08 2.245e+08 >>>>> MPI Messages: 3.500e+01 1.000 3.500e+01 7.000e+01 >>>>> MPI Message Lengths: 2.210e+10 1.000 6.313e+08 4.419e+10 >>>>> MPI Reductions: 4.100e+01 1.000 >>>>> >>>>> Flop counting convention: 1 flop = 1 real number operation of type >>>>> (multiply/divide/add/subtract) >>>>> e.g., VecAXPY() for real vectors of length N >>>>> --> 2N flop >>>>> and VecAXPY() for complex vectors of length N >>>>> --> 8N flop >>>>> >>>>> Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >>>>> --- -- Message Lengths -- -- Reductions -- >>>>> Avg %Total Avg %Total Count >>>>> %Total Avg %Total Count %Total >>>>> 0: Main Stage: 1.0257e+02 100.0% 2.3025e+10 100.0% 7.000e+01 >>>>> 100.0% 6.313e+08 100.0% 2.300e+01 56.1% >>>>> >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> See the 'Profiling' chapter of the users' manual for details on >>>>> interpreting output. >>>>> Phase summary info: >>>>> Count: number of times phase was executed >>>>> Time and Flop: Max - maximum over all processors >>>>> Ratio - ratio of maximum to minimum over all processors >>>>> Mess: number of messages sent >>>>> AvgLen: average message length (bytes) >>>>> Reduct: number of global reductions >>>>> Global: entire computation >>>>> Stage: stages of a computation. Set stages with PetscLogStagePush() >>>>> and PetscLogStagePop(). >>>>> %T - percent time in this phase %F - percent flop in this >>>>> phase >>>>> %M - percent messages in this phase %L - percent message >>>>> lengths in this phase >>>>> %R - percent reductions in this phase >>>>> Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >>>>> over all processors) >>>>> GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU >>>>> time over all processors) >>>>> CpuToGpu Count: total number of CPU to GPU copies per processor >>>>> CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >>>>> processor) >>>>> GpuToCpu Count: total number of GPU to CPU copies per processor >>>>> GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >>>>> processor) >>>>> GPU %F: percent flops on GPU in this event >>>>> ------------------------------------------------------------------------------------------------------------------------ >>>>> Event Count Time (sec) Flop >>>>> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >>>>> GpuToCpu - GPU >>>>> Max Ratio Max Ratio Max Ratio Mess AvgLen >>>>> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >>>>> Count Size %F >>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>>> >>>>> --- Event Stage 0: Main Stage >>>>> >>>>> BuildTwoSided 1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 >>>>> 4.0e+00 1.0e+00 0 0 3 0 2 0 0 3 0 4 0 0 0 >>>>> 0.00e+00 0 0.00e+00 0 >>>>> MatMult 30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 >>>>> 1.0e+00 65100 91 93 2 65100 91 93 4 346 0 0 0.00e+00 31 >>>>> 2.65e+04 0 >>>>> MatAssemblyBegin 1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> MatAssemblyEnd 1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 4.0e+00 19 0 0 0 10 19 0 0 0 17 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> MatLoad 1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08 >>>>> 1.6e+01 35 0 9 7 39 35 0 9 7 70 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> VecSet 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> VecScatterBegin 30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08 >>>>> 1.0e+00 5 0 91 93 2 5 0 91 93 4 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> VecScatterEnd 30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> VecCUDACopyFrom 31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 31 >>>>> 2.65e+04 0 >>>>> SFSetGraph 1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> SFSetUp 1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08 >>>>> 1.0e+00 5 0 6 2 2 5 0 6 2 4 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> SFPack 30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> SFUnpack 30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >>>>> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >>>>> 0.00e+00 0 >>>>> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >>>>> >>>>> Memory usage is given in bytes: >>>>> >>>>> Object Type Creations Destructions Memory Descendants' >>>>> Mem. >>>>> Reports information only for process 0. >>>>> >>>>> --- Event Stage 0: Main Stage >>>>> >>>>> Matrix 3 0 0 0. >>>>> Viewer 2 0 0 0. >>>>> Vector 4 1 1792 0. >>>>> Index Set 2 2 335250404 0. >>>>> Star Forest Graph 1 0 0 0. >>>>> ======================================================================================================================== >>>>> Average time to get PetscTime(): 3.77e-08 >>>>> Average time for MPI_Barrier(): 8.754e-07 >>>>> Average time for zero size MPI_Send(): 2.6755e-06 >>>>> #PETSc Option Table entries: >>>>> -log_view >>>>> -mat_type aijcusparse >>>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc >>>>> -n 20 >>>>> -vec_type cuda >>>>> -warmup 10 >>>>> #End of PETSc Option Table entries >>>>> Compiled without FORTRAN kernels >>>>> Compiled with full precision matrices (default) >>>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 >>>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4 >>>>> Configure options: --download-c2html=0 --download-hwloc=0 >>>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 >>>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so >>>>> /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" >>>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>>>> --with-clanguage=C --with-cxx-dialect=C++17 >>>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++ >>>>> --with-cuda=1 --with-debugging=0 >>>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>>>> --with-fftw=0 >>>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4 >>>>> --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 >>>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 >>>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 >>>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so >>>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" >>>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= >>>>> PETSC_ARCH=arch-linux-c-opt >>>>> ----------------------------------------- >>>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99 >>>>> Machine characteristics: >>>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo >>>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install >>>>> Using PETSc arch: >>>>> ----------------------------------------- >>>>> >>>>> Using C compiler: >>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>>>> -g -DNoChange -fPIC "-O3" >>>>> Using Fortran compiler: >>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>>>> -g -fPIC >>>>> ----------------------------------------- >>>>> >>>>> Using include paths: >>>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include >>>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include >>>>> -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include >>>>> ----------------------------------------- >>>>> >>>>> Using C linker: >>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc >>>>> Using Fortran linker: >>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran >>>>> Using libraries: >>>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib >>>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc >>>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >>>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib >>>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >>>>> >>>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib >>>>> -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 >>>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 >>>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >>>>> -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib >>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 >>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc >>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 >>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib >>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl >>>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse >>>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi >>>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s >>>>> -lquadmath -lpthread -lquadmath -lstdc++ -ldl >>>>> ----------------------------------------- >>>>> >>>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfad...@lbl.gov >>>>> <mailto:mfad...@lbl.gov>> wrote: >>>>> There are a few things: >>>>> * GPU have higher latencies and so you basically need a large enough >>>>> problem to get GPU speedup >>>>> * I assume you are assembling the matrix on the CPU. The copy of data to >>>>> the GPU takes time and you really should be creating the matrix on the GPU >>>>> * I agree with Barry, Roughly 1M / GPU is around where you start seeing a >>>>> win but this depends on a lot of things. >>>>> * There are startup costs, like the CPU-GPU copy. It is best to run one >>>>> mat-vec, or whatever, push a new stage and then run the benchmark. The >>>>> timing for this new stage will be separate in the log view data. Look at >>>>> that. >>>>> - You can fake this by running your benchmark many times to amortize any >>>>> setup costs. >>>>> >>>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <roh...@alumni.cmu.edu >>>>> <mailto:roh...@alumni.cmu.edu>> wrote: >>>>> Hi, >>>>> >>>>> I'm looking to use PETSc with GPUs to do some linear algebra operations, >>>>> like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and running with >>>>> `-mat_type aijcusparse -vec_type cuda` gives me a large slowdown from the >>>>> same code running on the CPU. This is not entirely unexpected, as things >>>>> like data transfer costs across the PCIE might erroneously be included in >>>>> my timing. Are there some examples of benchmarking GPU computations with >>>>> PETSc, or just the proper way to write code in PETSc that will work for >>>>> CPUs and GPUs? >>>>> >>>>> Rohan >>>> >>> >> >