Re: [petsc-users] Using distributed dense matrix/vector operations on a GPU

Roland Richter Tue, 16 Feb 2021 05:17:16 -0800

Hei,

the usual size of those matrices is (cumulative, not distributed) at
least [8192x8192] x [8192x32768] complex entries as lower boundary. Does
it still make sense to test CUDA for speedup?


Thank you,

regards,

Roland

Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>
>
> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter
> <[email protected] <mailto:[email protected]>> ha scritto:
>
>     Hei,
>
>     after profiling my program using -log_view, I got the following
>     output (all matrices are dense):
>
>     /Using 8 OpenMP threads//
>     //Using Petsc Development GIT revision: v3.14.3-583-g5464005aea 
>     GIT Date: 2021-01-25 16:01:41 -0600//
>     //
>     //                         Max       Max/Min     Avg       Total//
>     //Time (sec):           5.074e+03     1.000   5.074e+03//
>     //Objects:              2.158e+03     1.000   2.158e+03//
>     //Flop:                 5.236e+13     1.000   5.236e+13  5.236e+13//
>     //Flop/sec:             1.032e+10     1.000   1.032e+10  1.032e+10//
>     //MPI Messages:         0.000e+00     0.000   0.000e+00  0.000e+00//
>     //MPI Message Lengths:  0.000e+00     0.000   0.000e+00  0.000e+00//
>     //MPI Reductions:       0.000e+00     0.000//
>     //
>     //Flop counting convention: 1 flop = 1 real number operation of
>     type (multiply/divide/add/subtract)//
>     //                            e.g., VecAXPY() for real vectors of
>     length N --> 2N flop//
>     //                            and VecAXPY() for complex vectors of
>     length N --> 8N flop//
>     //
>     //Summary of Stages:   ----- Time ------  ----- Flop ------  ---
>     Messages ---  -- Message Lengths --  -- Reductions --//
>     //                        Avg     %Total     Avg     %Total   
>     Count   %Total     Avg         %Total    Count   %Total//
>     // 0:      Main Stage: 5.0744e+03 100.0%  5.2359e+13 100.0% 
>     0.000e+00   0.0%  0.000e+00        0.0%  0.000e+00   0.0%//
>     //
>     
> //------------------------------------------------------------------------------------------------------------------------//
>     //See the 'Profiling' chapter of the users' manual for details on
>     interpreting output.//
>     //Phase summary info://
>     //   Count: number of times phase was executed//
>     //   Time and Flop: Max - maximum over all processors//
>     //                  Ratio - ratio of maximum to minimum over all
>     processors//
>     //   Mess: number of messages sent//
>     //   AvgLen: average message length (bytes)//
>     //   Reduct: number of global reductions//
>     //   Global: entire computation//
>     //   Stage: stages of a computation. Set stages with
>     PetscLogStagePush() and PetscLogStagePop().//
>     //      %T - percent time in this phase         %F - percent flop
>     in this phase//
>     //      %M - percent messages in this phase     %L - percent
>     message lengths in this phase//
>     //      %R - percent reductions in this phase//
>     //   Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max
>     time over all processors)//
>     //   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all
>     processors)/(max GPU time over all processors)//
>     //   CpuToGpu Count: total number of CPU to GPU copies per processor//
>     //   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU
>     copies per processor)//
>     //   GpuToCpu Count: total number of GPU to CPU copies per processor//
>     //   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU
>     copies per processor)//
>     //   GPU %F: percent flops on GPU in this event//
>     
> //------------------------------------------------------------------------------------------------------------------------//
>     //Event                Count      Time (sec)    
>     Flop                              --- Global ---  --- Stage ---- 
>     Total   GPU    - CpuToGpu -   - GpuToCpu - GPU//
>     //                   Max Ratio  Max     Ratio   Max  Ratio  Mess  
>     AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s
>     Count   Size   Count   Size  %F//
>     
> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>     //
>     //--- Event Stage 0: Main Stage//
>     //
>     //VecSet                37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //VecAssemblyBegin      31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //VecAssemblyEnd        31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatCopy            49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  7  0  0  0  0   7  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatConvert          2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatScale           56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00
>     0.0e+00 0.0e+00 14  3  0  0  0  14  3  0  0  0  2303       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatAssemblyBegin   56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatAssemblyEnd     56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatZeroEntries     60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatAXPY             8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00
>     0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0  0  4557       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatMatMultSym       4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00
>     0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0     0       0     
>     0 0.00e+00    0 0.00e+00  0//
>     //MatMatMultNum       4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00
>     0.0e+00 0.0e+00  8 96  0  0  0   8 96  0  0  0 123331       0     
>     0 0.00e+00    0 0.00e+00  0//
>     
> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>     //
>     //Memory usage is given in bytes://
>     //
>     //Object Type          Creations   Destructions     Memory 
>     Descendants' Mem.//
>     //Reports information only for process 0.//
>     //
>     //--- Event Stage 0: Main Stage//
>     //
>     //              Vector    37             34      1634064     0.//
>     //              Matrix  2120           2120  52734663456     0.//
>     //              Viewer     1              0            0     0.//
>     
> //========================================================================================================================/
>
>     Apparently, MatMatMultNum and MatScale take the most time (by far)
>     during execution. Therefore, I was wondering if it is possible to
>     move those operations/all matrices and vectors to a GPU or another
>     accelerator. According to
>     https://www.mcs.anl.gov/petsc/features/gpus.html
>     <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA is only
>     supported for distributed vectors, but not for dense distributed
>     matrices. Are there any updates related to that, or other ways to
>     speed up the involved operations?
>
>
> You should compute the timings associated with each call, and not
> consider the lump sum. For example, each MatScale takes
> 6.9348e+02/56162  = 0.012347851 seconds on average,  I doubt you can
> get any reasonable speedup with CUDA. What are the sizes of these
> matrices? 
>  
>
>     Thanks!
>
>     Regards,
>
>     Roland
>
>
>
> -- 
> Stefano

Re: [petsc-users] Using distributed dense matrix/vector operations on a GPU

Reply via email to