> > > > the usual size of those matrices is (cumulative, not distributed) at least > [8192x8192] x [8192x32768] complex entries as lower boundary. Does it still > make sense to test CUDA for speedup? > > I don't understand your notation. Are you saying your matrices are 8K x 8K? or 8K*32K? or what?
> Thank you, > > regards, > > Roland > Am 16.02.21 um 14:14 schrieb Stefano Zampini: > > > > Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter < > [email protected]> ha scritto: > >> Hei, >> >> after profiling my program using -log_view, I got the following output >> (all matrices are dense): >> >> *Using 8 OpenMP threads* >> *Using Petsc Development GIT revision: v3.14.3-583-g5464005aea GIT Date: >> 2021-01-25 16:01:41 -0600* >> >> * Max Max/Min Avg Total* >> *Time (sec): 5.074e+03 1.000 5.074e+03* >> *Objects: 2.158e+03 1.000 2.158e+03* >> *Flop: 5.236e+13 1.000 5.236e+13 5.236e+13* >> *Flop/sec: 1.032e+10 1.000 1.032e+10 1.032e+10* >> *MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00* >> *MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00* >> *MPI Reductions: 0.000e+00 0.000* >> >> *Flop counting convention: 1 flop = 1 real number operation of type >> (multiply/divide/add/subtract)* >> * e.g., VecAXPY() for real vectors of length N >> --> 2N flop* >> * and VecAXPY() for complex vectors of length >> N --> 8N flop* >> >> *Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages >> --- -- Message Lengths -- -- Reductions --* >> * Avg %Total Avg %Total Count >> %Total Avg %Total Count %Total* >> * 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0% 0.000e+00 >> 0.0% 0.000e+00 0.0% 0.000e+00 0.0%* >> >> >> *------------------------------------------------------------------------------------------------------------------------* >> *See the 'Profiling' chapter of the users' manual for details on >> interpreting output.* >> *Phase summary info:* >> * Count: number of times phase was executed* >> * Time and Flop: Max - maximum over all processors* >> * Ratio - ratio of maximum to minimum over all >> processors* >> * Mess: number of messages sent* >> * AvgLen: average message length (bytes)* >> * Reduct: number of global reductions* >> * Global: entire computation* >> * Stage: stages of a computation. Set stages with PetscLogStagePush() >> and PetscLogStagePop().* >> * %T - percent time in this phase %F - percent flop in this >> phase* >> * %M - percent messages in this phase %L - percent message >> lengths in this phase* >> * %R - percent reductions in this phase* >> * Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time >> over all processors)* >> * GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max >> GPU time over all processors)* >> * CpuToGpu Count: total number of CPU to GPU copies per processor* >> * CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per >> processor)* >> * GpuToCpu Count: total number of GPU to CPU copies per processor* >> * GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per >> processor)* >> * GPU %F: percent flops on GPU in this event* >> >> *------------------------------------------------------------------------------------------------------------------------* >> *Event Count Time (sec) >> Flop --- Global --- --- Stage ---- Total >> GPU - CpuToGpu - - GpuToCpu - GPU* >> * Max Ratio Max Ratio Max Ratio Mess AvgLen >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >> Count Size %F* >> >> *---------------------------------------------------------------------------------------------------------------------------------------------------------------* >> >> *--- Event Stage 0: Main Stage* >> >> *VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00 0.0e+00 >> 0.0e+00 14 3 0 0 0 14 3 0 0 0 2303 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00 0.0e+00 >> 0.0e+00 2 1 0 0 0 2 1 0 0 0 4557 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0* >> *MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00 0.0e+00 >> 0.0e+00 8 96 0 0 0 8 96 0 0 0 123331 0 0 0.00e+00 0 >> 0.00e+00 0* >> >> *---------------------------------------------------------------------------------------------------------------------------------------------------------------* >> >> *Memory usage is given in bytes:* >> >> *Object Type Creations Destructions Memory Descendants' >> Mem.* >> *Reports information only for process 0.* >> >> *--- Event Stage 0: Main Stage* >> >> * Vector 37 34 1634064 0.* >> * Matrix 2120 2120 52734663456 0.* >> * Viewer 1 0 0 0.* >> >> *========================================================================================================================* >> >> Apparently, MatMatMultNum and MatScale take the most time (by far) during >> execution. Therefore, I was wondering if it is possible to move those >> operations/all matrices and vectors to a GPU or another accelerator. >> According to https://www.mcs.anl.gov/petsc/features/gpus.html CUDA is >> only supported for distributed vectors, but not for dense distributed >> matrices. Are there any updates related to that, or other ways to speed up >> the involved operations? >> > > You should compute the timings associated with each call, and not consider > the lump sum. For example, each MatScale takes 6.9348e+02/56162 = > 0.012347851 seconds on average, I doubt you can get any reasonable speedup > with CUDA. What are the sizes of these matrices? > > >> Thanks! >> >> Regards, >> >> Roland >> > > > -- > Stefano > > -- Stefano
