Hei, the usual size of those matrices is (cumulative, not distributed) at least [8192x8192] x [8192x32768] complex entries as lower boundary. Does it still make sense to test CUDA for speedup?
Thank you, regards, Roland Am 16.02.21 um 14:14 schrieb Stefano Zampini: > > > Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter > <[email protected] <mailto:[email protected]>> ha scritto: > > Hei, > > after profiling my program using -log_view, I got the following > output (all matrices are dense): > > /Using 8 OpenMP threads// > //Using Petsc Development GIT revision: v3.14.3-583-g5464005aea > GIT Date: 2021-01-25 16:01:41 -0600// > // > // Max Max/Min Avg Total// > //Time (sec): 5.074e+03 1.000 5.074e+03// > //Objects: 2.158e+03 1.000 2.158e+03// > //Flop: 5.236e+13 1.000 5.236e+13 5.236e+13// > //Flop/sec: 1.032e+10 1.000 1.032e+10 1.032e+10// > //MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00// > //MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00// > //MPI Reductions: 0.000e+00 0.000// > // > //Flop counting convention: 1 flop = 1 real number operation of > type (multiply/divide/add/subtract)// > // e.g., VecAXPY() for real vectors of > length N --> 2N flop// > // and VecAXPY() for complex vectors of > length N --> 8N flop// > // > //Summary of Stages: ----- Time ------ ----- Flop ------ --- > Messages --- -- Message Lengths -- -- Reductions --// > // Avg %Total Avg %Total > Count %Total Avg %Total Count %Total// > // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0% > 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%// > // > > //------------------------------------------------------------------------------------------------------------------------// > //See the 'Profiling' chapter of the users' manual for details on > interpreting output.// > //Phase summary info:// > // Count: number of times phase was executed// > // Time and Flop: Max - maximum over all processors// > // Ratio - ratio of maximum to minimum over all > processors// > // Mess: number of messages sent// > // AvgLen: average message length (bytes)// > // Reduct: number of global reductions// > // Global: entire computation// > // Stage: stages of a computation. Set stages with > PetscLogStagePush() and PetscLogStagePop().// > // %T - percent time in this phase %F - percent flop > in this phase// > // %M - percent messages in this phase %L - percent > message lengths in this phase// > // %R - percent reductions in this phase// > // Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max > time over all processors)// > // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all > processors)/(max GPU time over all processors)// > // CpuToGpu Count: total number of CPU to GPU copies per processor// > // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU > copies per processor)// > // GpuToCpu Count: total number of GPU to CPU copies per processor// > // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU > copies per processor)// > // GPU %F: percent flops on GPU in this event// > > //------------------------------------------------------------------------------------------------------------------------// > //Event Count Time (sec) > Flop --- Global --- --- Stage ---- > Total GPU - CpuToGpu - - GpuToCpu - GPU// > // Max Ratio Max Ratio Max Ratio Mess > AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s > Count Size Count Size %F// > > //---------------------------------------------------------------------------------------------------------------------------------------------------------------// > // > //--- Event Stage 0: Main Stage// > // > //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00 > 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0 0 2303 0 > 0 0.00e+00 0 0.00e+00 0// > //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00 > 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0 0 4557 0 > 0 0.00e+00 0 0.00e+00 0// > //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 > 0 0.00e+00 0 0.00e+00 0// > //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00 > 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0 0 123331 0 > 0 0.00e+00 0 0.00e+00 0// > > //---------------------------------------------------------------------------------------------------------------------------------------------------------------// > // > //Memory usage is given in bytes:// > // > //Object Type Creations Destructions Memory > Descendants' Mem.// > //Reports information only for process 0.// > // > //--- Event Stage 0: Main Stage// > // > // Vector 37 34 1634064 0.// > // Matrix 2120 2120 52734663456 0.// > // Viewer 1 0 0 0.// > > //========================================================================================================================/ > > Apparently, MatMatMultNum and MatScale take the most time (by far) > during execution. Therefore, I was wondering if it is possible to > move those operations/all matrices and vectors to a GPU or another > accelerator. According to > https://www.mcs.anl.gov/petsc/features/gpus.html > <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA is only > supported for distributed vectors, but not for dense distributed > matrices. Are there any updates related to that, or other ways to > speed up the involved operations? > > > You should compute the timings associated with each call, and not > consider the lump sum. For example, each MatScale takes > 6.9348e+02/56162 = 0.012347851 seconds on average, I doubt you can > get any reasonable speedup with CUDA. What are the sizes of these > matrices? > > > Thanks! > > Regards, > > Roland > > > > -- > Stefano
