Re: [petsc-dev] Using PETSC with GPUs

Barry Smith Thu, 20 Jan 2022 14:45:17 -0800


> On Jan 20, 2022, at 5:24 PM, Rohan Yadav <roh...@alumni.cmu.edu> wrote:
> 
> Thanks barry, this is what I was looking for. However, it doesn't seem to be 
> working for me (the reported times are significantly different still with 
> -log_view on and off).


   I think this is because your second loop will overlap the additional kernel 
launches with the GPU computations without -log_view but will not overlap with 
the -log_view (since -log_view forces each MatMult to end before the next one 
can be launched by the CPU). If you put the PetscLogGpuTimeBegin/End within the 
loop then the -log_view should have much less effect. But I am not sure exactly 
what will happen with them inside the loop and with -log_view since there will 
be "extra" PetscLogGpuTimeEnd synchronization points; I don't think they will 
matter but I cannot say for sure. Like I said, tricky.




> Here is my exact timing code:
> ```
> double avgTime = 0.0;
>   {
>     PetscLogDouble start, end;
>     PetscLogGpuTimeBegin();
>     for (int i = 0; i < warmup; i++) {
>       MatMult(A, x, y);
>     }
>     PetscLogGpuTimeEnd();
>     PetscLogGpuTimeBegin();
>     PetscTime(&start);
>     for (int i = 0; i < niter; i++) {
>       MatMult(A, x, y);
>     }
>     PetscLogGpuTimeEnd();
>     PetscTime(&end);
>     auto sec = end - start;
>     avgTime = double(sec) / double(niter);
>   }
> ```
> I'm measuring the time for a group of MatMult's as you suggested (with some 
> warmup iterations).
> 
> Rohan
> 
> On Thu, Jan 20, 2022 at 1:42 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
>    Some operations on the GPU are asynchronous, the CPU passes the kernel 
> launch to the GPU and then immediately returns ready to do something else 
> before the kernel is completed (or even started). Some like VecDot() where 
> the result is stored in a CPU memory have to block until the kernel is 
> complete and the result copied up to the CPU.  
> 
>   -log_view forces a the calls to PetscLogGpuTimeEnd() which has (for CUDA) 
> 
> cerr = 
> cudaEventRecord(petsc_gputimer_end,PetscDefaultCudaStream);CHKERRCUDA(cerr);
> cerr = cudaEventSynchronize(petsc_gputimer_end);CHKERRCUDA(cerr);
> cerr = 
> cudaEventElapsedTime(&gtime,petsc_gputimer_begin,petsc_gputimer_end);CHKERRCUDA(cerr);
> petsc_gtime += (PetscLogDouble)gtime/1000.0; /* convert milliseconds to 
> seconds */
> 
> which essentially causes the CPU to wait until the kernel is complete, hence 
> your time with -log_view captures the full time to run the kernel.
> 
> So timing with GPUs can be a tricky business (when do you want to block and 
> when do you not?) For your loop, you may want to use
> 
> PetscLogGpuTimeBegin()
>> start = now()
>> 
>> for (int i = 0; i < 10; i++) {
>>     MatMult(A, x, y);
>> }
> PetscLogGpuTimeEnd()
>> end = now()
>> print(end - start / 10)
>> ```
> 
> 
> Now after the loop it will wait until all the multiplies are completely done; 
> giving a better view of the time it takes. If you did
> 
> 
>> start = now()
>> 
>> for (int i = 0; i < 10; i++) {
> PetscLogGpuTimeBegin()
>>     MatMult(A, x, y);
> PetscLogGpuTimeEnd()
>> }
>> end = now()
>> print(end - start / 10)
>> ```
> 
> You would wait a longer time because the CPU could not tell the GPU about the 
> second kernel launch until the first kernel is completely done. Hence there 
> would be no overlap of GPU computation and CPU kernel launches (which take a 
> long time). 
> 
> IMHO timing individual operations like a single MatMult() on GPUs only has a 
> certain level of usefulness since you slow down the computation (by removing 
> the asynchronous nature between the GPU and CPU)  in order to get accurate 
> times. It is better to time something like a complete line solver, nonlinear 
> solve etc and not log at a finer granularity.
> 
> Barry
> 
> 
> 
> 
> 
>> On Jan 20, 2022, at 4:07 PM, Rohan Yadav <roh...@alumni.cmu.edu 
>> <mailto:roh...@alumni.cmu.edu>> wrote:
>> 
>> Another small question -- I'm a little confused around timing GPU codes with 
>> PETSc. I have a code that looks like:
>> ```
>> start = now()
>> for (int i = 0; i < 10; i++) {
>>     MatMult(A, x, y);
>> }
>> end = now()
>> print(end - start / 10)
>> ```
>> 
>> If I run this program with `-vec_type cuda -mat_type aijcusparse`, the GPUs 
>> are indeed utilized, but the recorded time is very tiny (i imagine just 
>> tracking the cost of launching cuda kernels). However, if I add `-log_view` 
>> to the command line arguments, then the resulting time printed matches what 
>> is recorded by `nvprof`. What is the correct way to benchmark PETSc with 
>> GPUs without having -log_view turned on?
>> 
>> Thanks,
>> 
>> Rohan
>> 
>> On Sat, Jan 15, 2022 at 7:37 AM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>   Oh yes, you are correct for this operation since the handling of different 
>> nonzero pattern is not trivial to implement well for the GPU.
>> 
>>> On Jan 15, 2022, at 1:17 AM, Rohan Yadav <roh...@alumni.cmu.edu 
>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>> 
>>> Scanning the source code for mpiseqaijcusparse confirms my thoughts -- when 
>>> used with DIFFERENT_NONZERO_PATTERN, it falls back to calling 
>>> MatAXPY_SeqAIJ, copying the data back over to the host.
>>> 
>>> Rohan
>>> 
>>> On Fri, Jan 14, 2022 at 10:16 PM Rohan Yadav <roh...@alumni.cmu.edu 
>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>> 
>>> 
>>> ---------- Forwarded message ---------
>>> From: Rohan Yadav <roh...@alumni.cmu.edu <mailto:roh...@alumni.cmu.edu>>
>>> Date: Fri, Jan 14, 2022 at 10:03 PM
>>> Subject: Re: [petsc-dev] Using PETSC with GPUs
>>> To: Barry Smith <bsm...@petsc.dev <mailto:bsm...@petsc.dev>>
>>> 
>>> 
>>> Ok, I'll try looking with greps like and see what I find.
>>> 
>>> >  My guess why your code is not using the seqaijcusparse is that you are 
>>> > not setting the type before you call MatLoad() hence it loads with 
>>> > SeqAIJ. -mat_type does not magically change a type once a matrix has a 
>>> > set type. I agree our documentation on how to make objects be GPU objects 
>>> > is horrible now.
>>> 
>>> I printed out my matrices with the PetscViewer objects and can confirm that 
>>> the type is seqaijcusparse. Perhaps for the way I'm using it 
>>> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to 
>>> get any more diagnostic info about why the cuda kernel isn't called...
>>> 
>>> Rohan
>>> 
>>> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsm...@petsc.dev 
>>> <mailto:bsm...@petsc.dev>> wrote:
>>> 
>>>   This changes rapidly and depends on if the backend is CUDA, HIP, Sycl, or 
>>> Kokkos. The only way to find out definitively is with, for example, 
>>> 
>>> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
>>> 
>>> 
>>>   Because of our, unfortunately, earlier naming choices you need to kind of 
>>> know what to grep for, for CUDA it may be cuSparse or cuBLAS
>>> 
>>>   Not yet merged branches may also have some operations that are still 
>>> being developed.
>>> 
>>>   My guess why your code is not using the seqaijcusparse is that you are 
>>> not setting the type before you call MatLoad() hence it loads with SeqAIJ. 
>>> -mat_type does not magically change a type once a matrix has a set type. I 
>>> agree our documentation on how to make objects be GPU objects is horrible 
>>> now.
>>> 
>>>   Barry
>>> 
>>> 
>>>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <roh...@alumni.cmu.edu 
>>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>>> 
>>>> I was wondering if there is a definitive list for what operations are and 
>>>> aren't supported for distributed GPU execution. For some operations, like 
>>>> `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the 
>>>> documentation, but other operations it is unclear, such as MatMatMult. 
>>>> Another scenario is the MatAXPY kernel, which supposedly has a 
>>>> SeqAIJCUSPARSE implementation, which I take means that it can only execute 
>>>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the 
>>>> kernel it doesn't seem to utilize the GPU.
>>>> 
>>>> Rohan
>>>> 
>>>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsm...@petsc.dev 
>>>> <mailto:bsm...@petsc.dev>> wrote:
>>>> 
>>>>   Just use 1 MPI rank. 
>>>> 
>>>>   
>>>> ------------------------------------------------------------------------------------------------------------------------
>>>> Event                Count      Time (sec)     Flop                        
>>>>       --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
>>>> GpuToCpu - GPU
>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
>>>> Count   Size  %F
>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>> 
>>>> --- Event Stage 0: Main Stage
>>>> 
>>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 4.0e+00 
>>>> 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0 0.00e+00    0 
>>>> 0.00e+00  0
>>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 
>>>> 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0 0.00e+00   31 
>>>> 2.65e+04  0
>>>> 
>>>> From this it is clear the matrix never ended up on the GPU, but the vector 
>>>> did. For each multiply, it is copying the vector from the GPU to the CPU 
>>>> and then doing the MatMult on the CPU. If the MatMult was done on the GPU 
>>>> the file number in the row would be 100% indicating all the flops were 
>>>> done on the GPU and the fifth from the end value of 0 would be some large 
>>>> number, being the flop rate on the GPU.
>>>> 
>>>> 
>>>> 
>>>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <roh...@alumni.cmu.edu 
>>>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>>>> 
>>>>> A log_view is attached at the end of the mail.
>>>>> 
>>>>> I am running on a large problem size (639 million nonzeros).
>>>>> 
>>>>> > * I assume you are assembling the matrix on the CPU. The copy of data 
>>>>> > to the GPU takes time and you really should be creating the matrix on 
>>>>> > the GPU
>>>>> 
>>>>> How do I do this? I'm loading the matrix in from a file, but I'm running 
>>>>> the computation several times (and with a warmup), so I would expect that 
>>>>> the data is copied onto the GPU the first time. My (cpu) code to do this 
>>>>> is here: 
>>>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68
>>>>>  
>>>>> <https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68>.
>>>>> 
>>>>> Log view:
>>>>> 
>>>>> ---------------------------------------------- PETSc Performance Summary: 
>>>>> ----------------------------------------------
>>>>> 
>>>>> ./bin/benchmark on a  named lassen75 with 2 processors, by yadav2 Fri Jan 
>>>>> 14 13:54:09 2022
>>>>> Using Petsc Release Version 3.16.3, unknown
>>>>> 
>>>>>                          Max       Max/Min     Avg       Total
>>>>> Time (sec):           1.026e+02     1.000   1.026e+02
>>>>> Objects:              1.200e+01     1.000   1.200e+01
>>>>> Flop:                 1.156e+10     1.009   1.151e+10  2.303e+10
>>>>> Flop/sec:             1.127e+08     1.009   1.122e+08  2.245e+08
>>>>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>>>>> MPI Message Lengths:  2.210e+10     1.000   6.313e+08  4.419e+10
>>>>> MPI Reductions:       4.100e+01     1.000
>>>>> 
>>>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>>>> (multiply/divide/add/subtract)
>>>>>                             e.g., VecAXPY() for real vectors of length N 
>>>>> --> 2N flop
>>>>>                             and VecAXPY() for complex vectors of length N 
>>>>> --> 8N flop
>>>>> 
>>>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages 
>>>>> ---  -- Message Lengths --  -- Reductions --
>>>>>                         Avg     %Total     Avg     %Total    Count   
>>>>> %Total     Avg         %Total    Count   %Total
>>>>>  0:      Main Stage: 1.0257e+02 100.0%  2.3025e+10 100.0%  7.000e+01 
>>>>> 100.0%  6.313e+08      100.0%  2.300e+01  56.1%
>>>>> 
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> See the 'Profiling' chapter of the users' manual for details on 
>>>>> interpreting output.
>>>>> Phase summary info:
>>>>>    Count: number of times phase was executed
>>>>>    Time and Flop: Max - maximum over all processors
>>>>>                   Ratio - ratio of maximum to minimum over all processors
>>>>>    Mess: number of messages sent
>>>>>    AvgLen: average message length (bytes)
>>>>>    Reduct: number of global reductions
>>>>>    Global: entire computation
>>>>>    Stage: stages of a computation. Set stages with PetscLogStagePush() 
>>>>> and PetscLogStagePop().
>>>>>       %T - percent time in this phase         %F - percent flop in this 
>>>>> phase
>>>>>       %M - percent messages in this phase     %L - percent message 
>>>>> lengths in this phase
>>>>>       %R - percent reductions in this phase
>>>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time 
>>>>> over all processors)
>>>>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU 
>>>>> time over all processors)
>>>>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>>>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
>>>>> processor)
>>>>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>>>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
>>>>> processor)
>>>>>    GPU %F: percent flops on GPU in this event
>>>>> ------------------------------------------------------------------------------------------------------------------------
>>>>> Event                Count      Time (sec)     Flop                       
>>>>>        --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
>>>>> GpuToCpu - GPU
>>>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   
>>>>> Count   Size  %F
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>> 
>>>>> --- Event Stage 0: Main Stage
>>>>> 
>>>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 
>>>>> 4.0e+00 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0 
>>>>> 0.00e+00    0 0.00e+00  0
>>>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 
>>>>> 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0 0.00e+00   31 
>>>>> 2.65e+04  0
>>>>> MatAssemblyBegin       1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> MatAssemblyEnd         1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 4.0e+00 19  0  0  0 10  19  0  0  0 17     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> MatLoad                1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08 
>>>>> 1.6e+01 35  0  9  7 39  35  0  9  7 70     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> VecSet                 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> VecScatterBegin       30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08 
>>>>> 1.0e+00  5  0 91 93  2   5  0 91 93  4     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> VecScatterEnd         30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> VecCUDACopyFrom       31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   31 
>>>>> 2.65e+04  0
>>>>> SFSetGraph             1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> SFSetUp                1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08 
>>>>> 1.0e+00  5  0  6  2  2   5  0  6  2  4     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> SFPack                30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> SFUnpack              30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>>>> 0.00e+00  0
>>>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>>>> 
>>>>> Memory usage is given in bytes:
>>>>> 
>>>>> Object Type          Creations   Destructions     Memory  Descendants' 
>>>>> Mem.
>>>>> Reports information only for process 0.
>>>>> 
>>>>> --- Event Stage 0: Main Stage
>>>>> 
>>>>>               Matrix     3              0            0     0.
>>>>>               Viewer     2              0            0     0.
>>>>>               Vector     4              1         1792     0.
>>>>>            Index Set     2              2    335250404     0.
>>>>>    Star Forest Graph     1              0            0     0.
>>>>> ========================================================================================================================
>>>>> Average time to get PetscTime(): 3.77e-08
>>>>> Average time for MPI_Barrier(): 8.754e-07
>>>>> Average time for zero size MPI_Send(): 2.6755e-06
>>>>> #PETSc Option Table entries:
>>>>> -log_view
>>>>> -mat_type aijcusparse
>>>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc
>>>>> -n 20
>>>>> -vec_type cuda
>>>>> -warmup 10
>>>>> #End of PETSc Option Table entries
>>>>> Compiled without FORTRAN kernels
>>>>> Compiled with full precision matrices (default)
>>>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>>>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>>>> Configure options: --download-c2html=0 --download-hwloc=0 
>>>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 
>>>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>>>>>  /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" 
>>>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>>>  --with-clanguage=C --with-cxx-dialect=C++17 
>>>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>>>>>  --with-cuda=1 --with-debugging=0 
>>>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>>>  --with-fftw=0 
>>>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>>>>>  --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 
>>>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 
>>>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 
>>>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so 
>>>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" 
>>>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= 
>>>>> PETSC_ARCH=arch-linux-c-opt
>>>>> -----------------------------------------
>>>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99
>>>>> Machine characteristics: 
>>>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>>>> Using PETSc arch:
>>>>> -----------------------------------------
>>>>> 
>>>>> Using C compiler: 
>>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>>>  -g -DNoChange -fPIC "-O3"
>>>>> Using Fortran compiler: 
>>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>>>  -g -fPIC
>>>>> -----------------------------------------
>>>>> 
>>>>> Using include paths: 
>>>>> -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include 
>>>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>>>>>  -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>>>> -----------------------------------------
>>>>> 
>>>>> Using C linker: 
>>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>>> Using Fortran linker: 
>>>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>>> Using libraries: 
>>>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib 
>>>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc 
>>>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib 
>>>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib 
>>>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>>>  
>>>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>>>  -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 
>>>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 
>>>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>>>>  -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib 
>>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>>>>  -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 
>>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc 
>>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc 
>>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 
>>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 
>>>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib 
>>>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl 
>>>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse 
>>>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi 
>>>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s 
>>>>> -lquadmath -lpthread -lquadmath -lstdc++ -ldl
>>>>> -----------------------------------------
>>>>> 
>>>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfad...@lbl.gov 
>>>>> <mailto:mfad...@lbl.gov>> wrote:
>>>>> There are a few things:
>>>>> * GPU have higher latencies and so you basically need a large enough 
>>>>> problem to get GPU speedup
>>>>> * I assume you are assembling the matrix on the CPU. The copy of data to 
>>>>> the GPU takes time and you really should be creating the matrix on the GPU
>>>>> * I agree with Barry, Roughly 1M / GPU is around where you start seeing a 
>>>>> win but this depends on a lot of things.
>>>>> * There are startup costs, like the CPU-GPU copy. It is best to run one 
>>>>> mat-vec, or whatever, push a new stage and then run the benchmark. The 
>>>>> timing for this new stage will be separate in the log view data. Look at 
>>>>> that.
>>>>>  - You can fake this by running your benchmark many times to amortize any 
>>>>> setup costs.
>>>>> 
>>>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <roh...@alumni.cmu.edu 
>>>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>>>> Hi,
>>>>> 
>>>>> I'm looking to use PETSc with GPUs to do some linear algebra operations, 
>>>>> like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and running with 
>>>>> `-mat_type aijcusparse -vec_type cuda` gives me a large slowdown from the 
>>>>> same code running on the CPU. This is not entirely unexpected, as things 
>>>>> like data transfer costs across the PCIE might erroneously be included in 
>>>>> my timing. Are there some examples of benchmarking GPU computations with 
>>>>> PETSc, or just the proper way to write code in PETSc that will work for 
>>>>> CPUs and GPUs?
>>>>> 
>>>>> Rohan
>>>> 
>>> 
>> 
>

Re: [petsc-dev] Using PETSC with GPUs

Reply via email to