For MatMatMult the size of the involved matrices is 8k x 8k and 8k x 32k. I am not sure where MatScale is called, I never call it explicitly. If MatDiagonalScale calls MatScale, then the involved matrices have a size of 8k x 32k.
Regards, Roland Am 16.02.21 um 14:25 schrieb Stefano Zampini: > > > > > the usual size of those matrices is (cumulative, not distributed) > at least [8192x8192] x [8192x32768] complex entries as lower > boundary. Does it still make sense to test CUDA for speedup? > > I don't understand your notation. Are you saying your matrices are 8K > x 8K? or 8K*32K? or what? > > > Thank you, > > regards, > > Roland > > Am 16.02.21 um 14:14 schrieb Stefano Zampini: >> >> >> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter >> <[email protected] <mailto:[email protected]>> ha scritto: >> >> Hei, >> >> after profiling my program using -log_view, I got the >> following output (all matrices are dense): >> >> /Using 8 OpenMP threads// >> //Using Petsc Development GIT revision: >> v3.14.3-583-g5464005aea GIT Date: 2021-01-25 16:01:41 -0600// >> // >> // Max Max/Min Avg >> Total// >> //Time (sec): 5.074e+03 1.000 5.074e+03// >> //Objects: 2.158e+03 1.000 2.158e+03// >> //Flop: 5.236e+13 1.000 5.236e+13 >> 5.236e+13// >> //Flop/sec: 1.032e+10 1.000 1.032e+10 >> 1.032e+10// >> //MPI Messages: 0.000e+00 0.000 0.000e+00 >> 0.000e+00// >> //MPI Message Lengths: 0.000e+00 0.000 0.000e+00 >> 0.000e+00// >> //MPI Reductions: 0.000e+00 0.000// >> // >> //Flop counting convention: 1 flop = 1 real number operation >> of type (multiply/divide/add/subtract)// >> // e.g., VecAXPY() for real >> vectors of length N --> 2N flop// >> // and VecAXPY() for complex >> vectors of length N --> 8N flop// >> // >> //Summary of Stages: ----- Time ------ ----- Flop ------ >> --- Messages --- -- Message Lengths -- -- Reductions --// >> // Avg %Total Avg >> %Total Count %Total Avg %Total Count >> %Total// >> // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 100.0% >> 0.000e+00 0.0% 0.000e+00 0.0% 0.000e+00 0.0%// >> // >> >> //------------------------------------------------------------------------------------------------------------------------// >> //See the 'Profiling' chapter of the users' manual for >> details on interpreting output.// >> //Phase summary info:// >> // Count: number of times phase was executed// >> // Time and Flop: Max - maximum over all processors// >> // Ratio - ratio of maximum to minimum over >> all processors// >> // Mess: number of messages sent// >> // AvgLen: average message length (bytes)// >> // Reduct: number of global reductions// >> // Global: entire computation// >> // Stage: stages of a computation. Set stages with >> PetscLogStagePush() and PetscLogStagePop().// >> // %T - percent time in this phase %F - percent >> flop in this phase// >> // %M - percent messages in this phase %L - percent >> message lengths in this phase// >> // %R - percent reductions in this phase// >> // Total Mflop/s: 10e-6 * (sum of flop over all >> processors)/(max time over all processors)// >> // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all >> processors)/(max GPU time over all processors)// >> // CpuToGpu Count: total number of CPU to GPU copies per >> processor// >> // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU to >> GPU copies per processor)// >> // GpuToCpu Count: total number of GPU to CPU copies per >> processor// >> // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU to >> CPU copies per processor)// >> // GPU %F: percent flops on GPU in this event// >> >> //------------------------------------------------------------------------------------------------------------------------// >> //Event Count Time (sec) >> Flop --- Global --- --- Stage >> ---- Total GPU - CpuToGpu - - GpuToCpu - GPU// >> // Max Ratio Max Ratio Max Ratio >> Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s >> Mflop/s Count Size Count Size %F// >> >> //---------------------------------------------------------------------------------------------------------------------------------------------------------------// >> // >> //--- Event Stage 0: Main Stage// >> // >> //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 1.0 >> 0.0e+00 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0 0 >> 2303 0 0 0.00e+00 0 0.00e+00 0// >> //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 1.0 >> 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0 0 >> 4557 0 0 0.00e+00 0 0.00e+00 0// >> //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 0.0 >> 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 >> 0 0 0 0.00e+00 0 0.00e+00 0// >> //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 1.0 >> 0.0e+00 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0 0 >> 123331 0 0 0.00e+00 0 0.00e+00 0// >> >> //---------------------------------------------------------------------------------------------------------------------------------------------------------------// >> // >> //Memory usage is given in bytes:// >> // >> //Object Type Creations Destructions Memory >> Descendants' Mem.// >> //Reports information only for process 0.// >> // >> //--- Event Stage 0: Main Stage// >> // >> // Vector 37 34 1634064 0.// >> // Matrix 2120 2120 52734663456 0.// >> // Viewer 1 0 0 0.// >> >> //========================================================================================================================/ >> >> Apparently, MatMatMultNum and MatScale take the most time (by >> far) during execution. Therefore, I was wondering if it is >> possible to move those operations/all matrices and vectors to >> a GPU or another accelerator. According to >> https://www.mcs.anl.gov/petsc/features/gpus.html >> <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA is >> only supported for distributed vectors, but not for dense >> distributed matrices. Are there any updates related to that, >> or other ways to speed up the involved operations? >> >> >> You should compute the timings associated with each call, and not >> consider the lump sum. For example, each MatScale takes >> 6.9348e+02/56162 = 0.012347851 seconds on average, I doubt you >> can get any reasonable speedup with CUDA. What are the sizes of >> these matrices? >> >> >> Thanks! >> >> Regards, >> >> Roland >> >> >> >> -- >> Stefano > > > > -- > Stefano
