You can run it in the debugger (-start_in_debugger noxterm) and put a break 
point in MatMult(). Then print the matrix in the debugger (for example p *mat) 
it will dump a huge amount of stuff including the function pointer table, in 
that there should be a MatMult_seqaijcusparse function pointer. If it is 
MatMult_SeqAIJ then that is why it is not running on the GPU. Look at the other 
functions in the function table, are any of the cuSPARSE functions. 

> DIFFERENT_NONZERO_PATTERN should not affect what functions it uses for the 
> operations


  Barry

This is why running with just 1 MPI rank is so much easier.

> On Jan 15, 2022, at 1:03 AM, Rohan Yadav <roh...@alumni.cmu.edu> wrote:
> 
> Ok, I'll try looking with greps like and see what I find.
> 
> >  My guess why your code is not using the seqaijcusparse is that you are not 
> > setting the type before you call MatLoad() hence it loads with SeqAIJ. 
> > -mat_type does not magically change a type once a matrix has a set type. I 
> > agree our documentation on how to make objects be GPU objects is horrible 
> > now.
> 
> I printed out my matrices with the PetscViewer objects and can confirm that 
> the type is seqaijcusparse. Perhaps for the way I'm using it 
> (DIFFERENT_NONZERO_PATTERN) the kernel is unsupported? I'm not sure how to 
> get any more diagnostic info about why the cuda kernel isn't called...
> 
> Rohan
> 
> On Fri, Jan 14, 2022 at 9:46 PM Barry Smith <bsm...@petsc.dev 
> <mailto:bsm...@petsc.dev>> wrote:
> 
>   This changes rapidly and depends on if the backend is CUDA, HIP, Sycl, or 
> Kokkos. The only way to find out definitively is with, for example, 
> 
> git grep MatMult_ | egrep -i "(cusparse|cublas|cuda)"
> 
> 
>   Because of our, unfortunately, earlier naming choices you need to kind of 
> know what to grep for, for CUDA it may be cuSparse or cuBLAS
> 
>   Not yet merged branches may also have some operations that are still being 
> developed.
> 
>   My guess why your code is not using the seqaijcusparse is that you are not 
> setting the type before you call MatLoad() hence it loads with SeqAIJ. 
> -mat_type does not magically change a type once a matrix has a set type. I 
> agree our documentation on how to make objects be GPU objects is horrible now.
> 
>   Barry
> 
> 
>> On Jan 15, 2022, at 12:31 AM, Rohan Yadav <roh...@alumni.cmu.edu 
>> <mailto:roh...@alumni.cmu.edu>> wrote:
>> 
>> I was wondering if there is a definitive list for what operations are and 
>> aren't supported for distributed GPU execution. For some operations, like 
>> `MatMult`, it is clear that MPIAIJCUSPARSE implements MatMult from the 
>> documentation, but other operations it is unclear, such as MatMatMult. 
>> Another scenario is the MatAXPY kernel, which supposedly has a 
>> SeqAIJCUSPARSE implementation, which I take means that it can only execute 
>> on a single GPU. However, even if I pass -mat_type seqaijcusparse to the 
>> kernel it doesn't seem to utilize the GPU.
>> 
>> Rohan
>> 
>> On Fri, Jan 14, 2022 at 4:05 PM Barry Smith <bsm...@petsc.dev 
>> <mailto:bsm...@petsc.dev>> wrote:
>> 
>>   Just use 1 MPI rank. 
>> 
>>   
>> ------------------------------------------------------------------------------------------------------------------------
>> Event                Count      Time (sec)     Flop                          
>>     --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu 
>> - GPU
>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count  
>>  Size  %F
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>> 
>> --- Event Stage 0: Main Stage
>> 
>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 4.0e+00 
>> 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0 0.00e+00    0 
>> 0.00e+00  0
>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 
>> 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0 0.00e+00   31 
>> 2.65e+04  0
>> 
>> From this it is clear the matrix never ended up on the GPU, but the vector 
>> did. For each multiply, it is copying the vector from the GPU to the CPU and 
>> then doing the MatMult on the CPU. If the MatMult was done on the GPU the 
>> file number in the row would be 100% indicating all the flops were done on 
>> the GPU and the fifth from the end value of 0 would be some large number, 
>> being the flop rate on the GPU.
>> 
>> 
>> 
>>> On Jan 14, 2022, at 4:59 PM, Rohan Yadav <roh...@alumni.cmu.edu 
>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>> 
>>> A log_view is attached at the end of the mail.
>>> 
>>> I am running on a large problem size (639 million nonzeros).
>>> 
>>> > * I assume you are assembling the matrix on the CPU. The copy of data to 
>>> > the GPU takes time and you really should be creating the matrix on the GPU
>>> 
>>> How do I do this? I'm loading the matrix in from a file, but I'm running 
>>> the computation several times (and with a warmup), so I would expect that 
>>> the data is copied onto the GPU the first time. My (cpu) code to do this is 
>>> here: 
>>> https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68
>>>  
>>> <https://github.com/rohany/taco/blob/5c0a4f4419ba392838590ce24e0043f632409e7b/petsc/benchmark.cpp#L68>.
>>> 
>>> Log view:
>>> 
>>> ---------------------------------------------- PETSc Performance Summary: 
>>> ----------------------------------------------
>>> 
>>> ./bin/benchmark on a  named lassen75 with 2 processors, by yadav2 Fri Jan 
>>> 14 13:54:09 2022
>>> Using Petsc Release Version 3.16.3, unknown
>>> 
>>>                          Max       Max/Min     Avg       Total
>>> Time (sec):           1.026e+02     1.000   1.026e+02
>>> Objects:              1.200e+01     1.000   1.200e+01
>>> Flop:                 1.156e+10     1.009   1.151e+10  2.303e+10
>>> Flop/sec:             1.127e+08     1.009   1.122e+08  2.245e+08
>>> MPI Messages:         3.500e+01     1.000   3.500e+01  7.000e+01
>>> MPI Message Lengths:  2.210e+10     1.000   6.313e+08  4.419e+10
>>> MPI Reductions:       4.100e+01     1.000
>>> 
>>> Flop counting convention: 1 flop = 1 real number operation of type 
>>> (multiply/divide/add/subtract)
>>>                             e.g., VecAXPY() for real vectors of length N 
>>> --> 2N flop
>>>                             and VecAXPY() for complex vectors of length N 
>>> --> 8N flop
>>> 
>>> Summary of Stages:   ----- Time ------  ----- Flop ------  --- Messages --- 
>>>  -- Message Lengths --  -- Reductions --
>>>                         Avg     %Total     Avg     %Total    Count   %Total 
>>>     Avg         %Total    Count   %Total
>>>  0:      Main Stage: 1.0257e+02 100.0%  2.3025e+10 100.0%  7.000e+01 100.0% 
>>>  6.313e+08      100.0%  2.300e+01  56.1%
>>> 
>>> ------------------------------------------------------------------------------------------------------------------------
>>> See the 'Profiling' chapter of the users' manual for details on 
>>> interpreting output.
>>> Phase summary info:
>>>    Count: number of times phase was executed
>>>    Time and Flop: Max - maximum over all processors
>>>                   Ratio - ratio of maximum to minimum over all processors
>>>    Mess: number of messages sent
>>>    AvgLen: average message length (bytes)
>>>    Reduct: number of global reductions
>>>    Global: entire computation
>>>    Stage: stages of a computation. Set stages with PetscLogStagePush() and 
>>> PetscLogStagePop().
>>>       %T - percent time in this phase         %F - percent flop in this 
>>> phase
>>>       %M - percent messages in this phase     %L - percent message lengths 
>>> in this phase
>>>       %R - percent reductions in this phase
>>>    Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over 
>>> all processors)
>>>    GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU 
>>> time over all processors)
>>>    CpuToGpu Count: total number of CPU to GPU copies per processor
>>>    CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per 
>>> processor)
>>>    GpuToCpu Count: total number of GPU to CPU copies per processor
>>>    GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per 
>>> processor)
>>>    GPU %F: percent flops on GPU in this event
>>> ------------------------------------------------------------------------------------------------------------------------
>>> Event                Count      Time (sec)     Flop                         
>>>      --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
>>> GpuToCpu - GPU
>>>                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
>>> Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count 
>>>   Size  %F
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>> BuildTwoSided          1 1.0 1.8650e-013467.8 0.00e+00 0.0 2.0e+00 4.0e+00 
>>> 1.0e+00  0  0  3  0  2   0  0  3  0  4     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> MatMult               30 1.0 6.6642e+01 1.0 1.16e+10 1.0 6.4e+01 6.4e+08 
>>> 1.0e+00 65100 91 93  2  65100 91 93  4   346       0      0 0.00e+00   31 
>>> 2.65e+04  0
>>> MatAssemblyBegin       1 1.0 3.1100e-07 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> MatAssemblyEnd         1 1.0 1.9798e+01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 4.0e+00 19  0  0  0 10  19  0  0  0 17     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> MatLoad                1 1.0 3.5519e+01 1.0 0.00e+00 0.0 6.0e+00 5.4e+08 
>>> 1.6e+01 35  0  9  7 39  35  0  9  7 70     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> VecSet                 5 1.0 5.8959e-02 1.1 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> VecScatterBegin       30 1.0 5.4085e+00 1.0 0.00e+00 0.0 6.4e+01 6.4e+08 
>>> 1.0e+00  5  0 91 93  2   5  0 91 93  4     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> VecScatterEnd         30 1.0 9.2544e+00 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> VecCUDACopyFrom       31 1.0 4.0174e-01 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00   31 
>>> 2.65e+04  0
>>> SFSetGraph             1 1.0 4.4912e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> SFSetUp                1 1.0 5.2595e+00 1.0 0.00e+00 0.0 4.0e+00 1.7e+08 
>>> 1.0e+00  5  0  6  2  2   5  0  6  2  4     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> SFPack                30 1.0 3.4021e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> SFUnpack              30 1.0 1.9222e-05 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0      0 0.00e+00    0 
>>> 0.00e+00  0
>>> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 
>>> Memory usage is given in bytes:
>>> 
>>> Object Type          Creations   Destructions     Memory  Descendants' Mem.
>>> Reports information only for process 0.
>>> 
>>> --- Event Stage 0: Main Stage
>>> 
>>>               Matrix     3              0            0     0.
>>>               Viewer     2              0            0     0.
>>>               Vector     4              1         1792     0.
>>>            Index Set     2              2    335250404     0.
>>>    Star Forest Graph     1              0            0     0.
>>> ========================================================================================================================
>>> Average time to get PetscTime(): 3.77e-08
>>> Average time for MPI_Barrier(): 8.754e-07
>>> Average time for zero size MPI_Send(): 2.6755e-06
>>> #PETSc Option Table entries:
>>> -log_view
>>> -mat_type aijcusparse
>>> -matrix /p/gpfs1/yadav2/tensors//petsc/kmer_V1r.petsc
>>> -n 20
>>> -vec_type cuda
>>> -warmup 10
>>> #End of PETSc Option Table entries
>>> Compiled without FORTRAN kernels
>>> Compiled with full precision matrices (default)
>>> sizeof(short) 2 sizeof(int) 4 sizeof(long) 8 sizeof(void*) 8 
>>> sizeof(PetscScalar) 8 sizeof(PetscInt) 4
>>> Configure options: --download-c2html=0 --download-hwloc=0 
>>> --download-sowing=0 --prefix=./petsc-install/ --with-64-bit-indices=0 
>>> --with-blaslapack-lib="/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/liblapack.so
>>>  /usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib/libblas.so" 
>>> --with-cc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>  --with-clanguage=C --with-cxx-dialect=C++17 
>>> --with-cxx=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpig++
>>>  --with-cuda=1 --with-debugging=0 
>>> --with-fc=/usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>  --with-fftw=0 
>>> --with-hdf5-dir=/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4
>>>  --with-hdf5=1 --with-mumps=0 --with-precision=double --with-scalapack=0 
>>> --with-scalar-type=real --with-shared-libraries=1 --with-ssl=0 
>>> --with-suitesparse=0 --with-trilinos=0 --with-valgrind=0 --with-x=0 
>>> --with-zlib-include=/usr/include --with-zlib-lib=/usr/lib64/libz.so 
>>> --with-zlib=1 CFLAGS="-g -DNoChange" COPTFLAGS="-O3" CXXFLAGS="-O3" 
>>> CXXOPTFLAGS="-O3" FFLAGS=-g CUDAFLAGS=-std=c++17 FOPTFLAGS= 
>>> PETSC_ARCH=arch-linux-c-opt
>>> -----------------------------------------
>>> Libraries compiled on 2022-01-14 20:56:04 on lassen99
>>> Machine characteristics: 
>>> Linux-4.14.0-115.21.2.1chaos.ch6a.ppc64le-ppc64le-with-redhat-7.6-Maipo
>>> Using PETSc directory: /g/g15/yadav2/taco/petsc/petsc/petsc-install
>>> Using PETSc arch:
>>> -----------------------------------------
>>> 
>>> Using C compiler: 
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>>  -g -DNoChange -fPIC "-O3"
>>> Using Fortran compiler: 
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>>  -g -fPIC
>>> -----------------------------------------
>>> 
>>> Using include paths: -I/g/g15/yadav2/taco/petsc/petsc/petsc-install/include 
>>> -I/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/include
>>>  -I/usr/include -I/usr/tce/packages/cuda/cuda-11.1.0/include
>>> -----------------------------------------
>>> 
>>> Using C linker: 
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigcc
>>> Using Fortran linker: 
>>> /usr/tce/packages/spectrum-mpi/spectrum-mpi-rolling-release-gcc-8.3.1/bin/mpigfortran
>>> Using libraries: 
>>> -Wl,-rpath,/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib 
>>> -L/g/g15/yadav2/taco/petsc/petsc/petsc-install/lib -lpetsc 
>>> -Wl,-rpath,/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib 
>>> -L/usr/tcetmp/packages/lapack/lapack-3.9.0-gcc-7.3.1/lib 
>>> -Wl,-rpath,/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>  
>>> -L/usr/tcetmp/packages/petsc/build/3.13.0/spack/opt/spack/linux-rhel7-power9le/xl_r-16.1/hdf5-1.10.6-e7e7urb5k7va3ib7j4uro56grvzmcmd4/lib
>>>  -Wl,-rpath,/usr/tce/packages/cuda/cuda-11.1.0/lib64 
>>> -L/usr/tce/packages/cuda/cuda-11.1.0/lib64 
>>> -Wl,-rpath,/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib
>>>  -L/usr/tce/packages/spectrum-mpi/ibm/spectrum-mpi-rolling-release/lib 
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8
>>>  -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc/ppc64le-redhat-linux/8 
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc 
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib/gcc 
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib64 
>>> -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib 
>>> -L/usr/tce/packages/gcc/gcc-8.3.1/rh/usr/lib -llapack -lblas -lhdf5_hl 
>>> -lhdf5 -lm /usr/lib64/libz.so -lcuda -lcudart -lcufft -lcublas -lcusparse 
>>> -lcusolver -lcurand -lstdc++ -ldl -lmpiprofilesupport -lmpi_ibm_usempi 
>>> -lmpi_ibm_mpifh -lmpi_ibm -lgfortran -lm -lgfortran -lm -lgcc_s -lquadmath 
>>> -lpthread -lquadmath -lstdc++ -ldl
>>> -----------------------------------------
>>> 
>>> On Fri, Jan 14, 2022 at 1:43 PM Mark Adams <mfad...@lbl.gov 
>>> <mailto:mfad...@lbl.gov>> wrote:
>>> There are a few things:
>>> * GPU have higher latencies and so you basically need a large enough 
>>> problem to get GPU speedup
>>> * I assume you are assembling the matrix on the CPU. The copy of data to 
>>> the GPU takes time and you really should be creating the matrix on the GPU
>>> * I agree with Barry, Roughly 1M / GPU is around where you start seeing a 
>>> win but this depends on a lot of things.
>>> * There are startup costs, like the CPU-GPU copy. It is best to run one 
>>> mat-vec, or whatever, push a new stage and then run the benchmark. The 
>>> timing for this new stage will be separate in the log view data. Look at 
>>> that.
>>>  - You can fake this by running your benchmark many times to amortize any 
>>> setup costs.
>>> 
>>> On Fri, Jan 14, 2022 at 4:27 PM Rohan Yadav <roh...@alumni.cmu.edu 
>>> <mailto:roh...@alumni.cmu.edu>> wrote:
>>> Hi,
>>> 
>>> I'm looking to use PETSc with GPUs to do some linear algebra operations, 
>>> like SpMV, SPMM etc. Building PETSc with `--with-cuda=1` and running with 
>>> `-mat_type aijcusparse -vec_type cuda` gives me a large slowdown from the 
>>> same code running on the CPU. This is not entirely unexpected, as things 
>>> like data transfer costs across the PCIE might erroneously be included in 
>>> my timing. Are there some examples of benchmarking GPU computations with 
>>> PETSc, or just the proper way to write code in PETSc that will work for 
>>> CPUs and GPUs?
>>> 
>>> Rohan
>> 
> 

Reply via email to