Re: [petsc-users] Using distributed dense matrix/vector operations on a GPU

Roland Richter Tue, 16 Feb 2021 05:56:16 -0800

Yes, I call MatAXPY, but the matrix size stays the same.

Regards,


Roland

Am 16.02.21 um 14:46 schrieb Stefano Zampini:
>
> Il giorno mar 16 feb 2021 alle ore 16:30 Roland Richter
> <[email protected] <mailto:[email protected]>> ha scritto:
>
>     For MatMatMult the size of the involved matrices is  8k x 8k and
>     8k x 32k.
>
> Ok, so you have 32k columns to multiply against. Maybe you can get
> some speedup
> Howver, if you keep updating the matrix entries on CPU, then using
> CUDA will make little sense.
> In any case, you can try and see if you get any speedup 
>
>     I am not sure where MatScale is called, I never call it
>     explicitly. If MatDiagonalScale calls MatScale, then the involved
>     matrices have a size of 8k x 32k.
>
> No, it does not, Are you calling MatAYPX? 
>
>  
>
>     Regards,
>
>     Roland
>
>     Am 16.02.21 um 14:25 schrieb Stefano Zampini:
>>
>>
>>          
>>
>>         the usual size of those matrices is (cumulative, not
>>         distributed) at least [8192x8192] x [8192x32768] complex
>>         entries as lower boundary. Does it still make sense to test
>>         CUDA for speedup?
>>
>>     I don't understand your notation. Are you saying your matrices
>>     are 8K x 8K? or 8K*32K? or what?
>>      
>>
>>         Thank you,
>>
>>         regards,
>>
>>         Roland
>>
>>         Am 16.02.21 um 14:14 schrieb Stefano Zampini:
>>>
>>>
>>>         Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter
>>>         <[email protected] <mailto:[email protected]>> ha
>>>         scritto:
>>>
>>>             Hei,
>>>
>>>             after profiling my program using -log_view, I got the
>>>             following output (all matrices are dense):
>>>
>>>             /Using 8 OpenMP threads//
>>>             //Using Petsc Development GIT revision:
>>>             v3.14.3-583-g5464005aea  GIT Date: 2021-01-25 16:01:41
>>>             -0600//
>>>             //
>>>             //                         Max       Max/Min    
>>>             Avg       Total//
>>>             //Time (sec):           5.074e+03     1.000   5.074e+03//
>>>             //Objects:              2.158e+03     1.000   2.158e+03//
>>>             //Flop:                 5.236e+13     1.000   5.236e+13 
>>>             5.236e+13//
>>>             //Flop/sec:             1.032e+10     1.000   1.032e+10 
>>>             1.032e+10//
>>>             //MPI Messages:         0.000e+00     0.000   0.000e+00 
>>>             0.000e+00//
>>>             //MPI Message Lengths:  0.000e+00     0.000   0.000e+00 
>>>             0.000e+00//
>>>             //MPI Reductions:       0.000e+00     0.000//
>>>             //
>>>             //Flop counting convention: 1 flop = 1 real number
>>>             operation of type (multiply/divide/add/subtract)//
>>>             //                            e.g., VecAXPY() for real
>>>             vectors of length N --> 2N flop//
>>>             //                            and VecAXPY() for complex
>>>             vectors of length N --> 8N flop//
>>>             //
>>>             //Summary of Stages:   ----- Time ------  ----- Flop
>>>             ------  --- Messages ---  -- Message Lengths --  --
>>>             Reductions --//
>>>             //                        Avg     %Total     Avg    
>>>             %Total    Count   %Total     Avg         %Total   
>>>             Count   %Total//
>>>             // 0:      Main Stage: 5.0744e+03 100.0%  5.2359e+13
>>>             100.0%  0.000e+00   0.0%  0.000e+00        0.0% 
>>>             0.000e+00   0.0%//
>>>             //
>>>             
>>> //------------------------------------------------------------------------------------------------------------------------//
>>>             //See the 'Profiling' chapter of the users' manual for
>>>             details on interpreting output.//
>>>             //Phase summary info://
>>>             //   Count: number of times phase was executed//
>>>             //   Time and Flop: Max - maximum over all processors//
>>>             //                  Ratio - ratio of maximum to minimum
>>>             over all processors//
>>>             //   Mess: number of messages sent//
>>>             //   AvgLen: average message length (bytes)//
>>>             //   Reduct: number of global reductions//
>>>             //   Global: entire computation//
>>>             //   Stage: stages of a computation. Set stages with
>>>             PetscLogStagePush() and PetscLogStagePop().//
>>>             //      %T - percent time in this phase         %F -
>>>             percent flop in this phase//
>>>             //      %M - percent messages in this phase     %L -
>>>             percent message lengths in this phase//
>>>             //      %R - percent reductions in this phase//
>>>             //   Total Mflop/s: 10e-6 * (sum of flop over all
>>>             processors)/(max time over all processors)//
>>>             //   GPU Mflop/s: 10e-6 * (sum of flop on GPU over all
>>>             processors)/(max GPU time over all processors)//
>>>             //   CpuToGpu Count: total number of CPU to GPU copies
>>>             per processor//
>>>             //   CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU
>>>             to GPU copies per processor)//
>>>             //   GpuToCpu Count: total number of GPU to CPU copies
>>>             per processor//
>>>             //   GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU
>>>             to CPU copies per processor)//
>>>             //   GPU %F: percent flops on GPU in this event//
>>>             
>>> //------------------------------------------------------------------------------------------------------------------------//
>>>             //Event                Count      Time (sec)    
>>>             Flop                              --- Global ---  ---
>>>             Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU//
>>>             //                   Max Ratio  Max     Ratio   Max 
>>>             Ratio  Mess   AvgLen  Reduct  %T %F %M %L %R  %T %F %M
>>>             %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F//
>>>             
>>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>>>             //
>>>             //--- Event Stage 0: Main Stage//
>>>             //
>>>             //VecSet                37 1.0 1.0354e-04 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //VecAssemblyBegin      31 1.0 2.9080e-06 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //VecAssemblyEnd        31 1.0 2.3270e-06 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatCopy            49928 1.0 3.7437e+02 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  7  0  0  0  0   7  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatConvert          2080 1.0 5.8492e+00 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatScale           56162 1.0 6.9348e+02 1.0 1.60e+12
>>>             1.0 0.0e+00 0.0e+00 0.0e+00 14  3  0  0  0  14  3  0  0 
>>>             0  2303       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatAssemblyBegin   56222 1.0 1.7370e-02 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatAssemblyEnd     56222 1.0 8.8713e-03 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatZeroEntries     60363 1.0 3.1011e+02 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  6  0  0  0  0   6  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatAXPY             8320 1.0 1.2254e+02 1.0 5.58e+11
>>>             1.0 0.0e+00 0.0e+00 0.0e+00  2  1  0  0  0   2  1  0  0 
>>>             0  4557       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatMatMultSym       4161 1.0 7.1613e-03 1.0 0.00e+00
>>>             0.0 0.0e+00 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0 
>>>             0     0       0      0 0.00e+00    0 0.00e+00  0//
>>>             //MatMatMultNum       4161 1.0 4.0706e+02 1.0 5.02e+13
>>>             1.0 0.0e+00 0.0e+00 0.0e+00  8 96  0  0  0   8 96  0  0 
>>>             0 123331       0      0 0.00e+00    0 0.00e+00  0//
>>>             
>>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------//
>>>             //
>>>             //Memory usage is given in bytes://
>>>             //
>>>             //Object Type          Creations   Destructions    
>>>             Memory  Descendants' Mem.//
>>>             //Reports information only for process 0.//
>>>             //
>>>             //--- Event Stage 0: Main Stage//
>>>             //
>>>             //              Vector    37             34     
>>>             1634064     0.//
>>>             //              Matrix  2120           2120 
>>>             52734663456     0.//
>>>             //              Viewer     1              0           
>>>             0     0.//
>>>             
>>> //========================================================================================================================/
>>>
>>>             Apparently, MatMatMultNum and MatScale take the most
>>>             time (by far) during execution. Therefore, I was
>>>             wondering if it is possible to move those operations/all
>>>             matrices and vectors to a GPU or another accelerator.
>>>             According to
>>>             https://www.mcs.anl.gov/petsc/features/gpus.html
>>>             <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA
>>>             is only supported for distributed vectors, but not for
>>>             dense distributed matrices. Are there any updates
>>>             related to that, or other ways to speed up the involved
>>>             operations?
>>>
>>>
>>>         You should compute the timings associated with each call,
>>>         and not consider the lump sum. For example, each MatScale
>>>         takes 6.9348e+02/56162  = 0.012347851 seconds on average,  I
>>>         doubt you can get any reasonable speedup with CUDA. What are
>>>         the sizes of these matrices? 
>>>          
>>>
>>>             Thanks!
>>>
>>>             Regards,
>>>
>>>             Roland
>>>
>>>
>>>
>>>         -- 
>>>         Stefano
>>
>>
>>
>>     -- 
>>     Stefano
>
>
>
> -- 
> Stefano

Re: [petsc-users] Using distributed dense matrix/vector operations on a GPU

Reply via email to