Hei, after profiling my program using -log_view, I got the following output (all matrices are dense):
/Using 8 OpenMP threads// //Using Petsc Development GIT revision: v3.14.3-583-g5464005aea GIT Date: 2021-01-25 16:01:41 -0600// // // Max Max/Min Avg Total// //Time (sec): 5.074e+03 1.000 5.074e+03// //Objects: 2.158e+03 1.000 2.158e+03// //Flop: 5.236e+13 1.000 5.236e+13 5.236e+13// //Flop/sec: 1.032e+10 1.000 1.032e+10 1.032e+10// //MPI Messages: 0.000e+00 0.000 0.000e+00 0.000e+00// //MPI Message Lengths: 0.000e+00 0.000 0.000e+00 0.000e+00// //MPI Reductions: 0.000e+00 0.000// // //Flop counting convention: 1 flop = 1 real number operation of type (multiply/divide/add/subtract)// // e.g., VecAXPY() for real vectors of length N --> 2N flop// // and VecAXPY() for complex vectors of length N --> 8N flop// // //Summary of Stages: ----- Time ------ ----- Flop ------ --- Messages --- -- Message Lengths -- -- Reductions --// // Avg %Total Avg %Total Count %Total Avg %Total Count %Total// // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0% 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%// // //------------------------------------------------------------------------------------------------------------------------// //See the 'Profiling' chapter of the users' manual for details on interpreting output.// //Phase summary info:// // Count: number of times phase was executed// // Time and Flop: Max - maximum over all processors// // Ratio - ratio of maximum to minimum over all processors// // Mess: number of messages sent// // AvgLen: average message length (bytes)// // Reduct: number of global reductions// // Global: entire computation// // Stage: stages of a computation. Set stages with PetscLogStagePush() and PetscLogStagePop().// // %T - percent time in this phase %F - percent flop in this phase// // %M - percent messages in this phase %L - percent message lengths in this phase// // %R - percent reductions in this phase// // Total Mflop/s: 10e-6 * (sum of flop over all processors)/(max time over all processors)// // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all processors)/(max GPU time over all processors)// // CpuToGpu Count: total number of CPU to GPU copies per processor// // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to GPU copies per processor)// // GpuToCpu Count: total number of GPU to CPU copies per processor// // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to CPU copies per processor)// // GPU %F: percent flops on GPU in this event// //------------------------------------------------------------------------------------------------------------------------// //Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU// // Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count Size %F// //---------------------------------------------------------------------------------------------------------------------------------------------------------------// // //--- Event Stage 0: Main Stage// // //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 0.0e+00 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0 0 2303 0 0 0.00e+00 0 0.00e+00 0// //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0 0 4557 0 0 0.00e+00 0 0.00e+00 0// //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0// //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 0.0e+00 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0 0 123331 0 0 0.00e+00 0 0.00e+00 0// //---------------------------------------------------------------------------------------------------------------------------------------------------------------// // //Memory usage is given in bytes:// // //Object Type Creations Destructions Memory Descendants' Mem.// //Reports information only for process 0.// // //--- Event Stage 0: Main Stage// // // Vector 37 34 1634064 0.// // Matrix 2120 2120 52734663456 0.// // Viewer 1 0 0 0.// //========================================================================================================================/ Apparently, MatMatMultNum and MatScale take the most time (by far) during execution. Therefore, I was wondering if it is possible to move those operations/all matrices and vectors to a GPU or another accelerator. According to https://www.mcs.anl.gov/petsc/features/gpus.html CUDA is only supported for distributed vectors, but not for dense distributed matrices. Are there any updates related to that, or other ways to speed up the involved operations? Thanks! Regards, Roland
