Yes, I call MatAXPY, but the matrix size stays the same. Regards,
Roland Am 16.02.21 um 14:46 schrieb Stefano Zampini: > > Il giorno mar 16 feb 2021 alle ore 16:30 Roland Richter > <[email protected] <mailto:[email protected]>> ha scritto: > > For MatMatMult the size of the involved matrices is 8k x 8k and > 8k x 32k. > > Ok, so you have 32k columns to multiply against. Maybe you can get > some speedup > Howver, if you keep updating the matrix entries on CPU, then using > CUDA will make little sense. > In any case, you can try and see if you get any speedup > > I am not sure where MatScale is called, I never call it > explicitly. If MatDiagonalScale calls MatScale, then the involved > matrices have a size of 8k x 32k. > > No, it does not, Are you calling MatAYPX? > > > > Regards, > > Roland > > Am 16.02.21 um 14:25 schrieb Stefano Zampini: >> >> >> >> >> the usual size of those matrices is (cumulative, not >> distributed) at least [8192x8192] x [8192x32768] complex >> entries as lower boundary. Does it still make sense to test >> CUDA for speedup? >> >> I don't understand your notation. Are you saying your matrices >> are 8K x 8K? or 8K*32K? or what? >> >> >> Thank you, >> >> regards, >> >> Roland >> >> Am 16.02.21 um 14:14 schrieb Stefano Zampini: >>> >>> >>> Il giorno mar 16 feb 2021 alle ore 11:43 Roland Richter >>> <[email protected] <mailto:[email protected]>> ha >>> scritto: >>> >>> Hei, >>> >>> after profiling my program using -log_view, I got the >>> following output (all matrices are dense): >>> >>> /Using 8 OpenMP threads// >>> //Using Petsc Development GIT revision: >>> v3.14.3-583-g5464005aea GIT Date: 2021-01-25 16:01:41 >>> -0600// >>> // >>> // Max Max/Min >>> Avg Total// >>> //Time (sec): 5.074e+03 1.000 5.074e+03// >>> //Objects: 2.158e+03 1.000 2.158e+03// >>> //Flop: 5.236e+13 1.000 5.236e+13 >>> 5.236e+13// >>> //Flop/sec: 1.032e+10 1.000 1.032e+10 >>> 1.032e+10// >>> //MPI Messages: 0.000e+00 0.000 0.000e+00 >>> 0.000e+00// >>> //MPI Message Lengths: 0.000e+00 0.000 0.000e+00 >>> 0.000e+00// >>> //MPI Reductions: 0.000e+00 0.000// >>> // >>> //Flop counting convention: 1 flop = 1 real number >>> operation of type (multiply/divide/add/subtract)// >>> // e.g., VecAXPY() for real >>> vectors of length N --> 2N flop// >>> // and VecAXPY() for complex >>> vectors of length N --> 8N flop// >>> // >>> //Summary of Stages: ----- Time ------ ----- Flop >>> ------ --- Messages --- -- Message Lengths -- -- >>> Reductions --// >>> // Avg %Total Avg >>> %Total Count %Total Avg %Total >>> Count %Total// >>> // 0: Main Stage: 5.0744e+03 100.0% 5.2359e+13 >>> 100.0% 0.000e+00 0.0% 0.000e+00 0.0% >>> 0.000e+00 0.0%// >>> // >>> >>> //------------------------------------------------------------------------------------------------------------------------// >>> //See the 'Profiling' chapter of the users' manual for >>> details on interpreting output.// >>> //Phase summary info:// >>> // Count: number of times phase was executed// >>> // Time and Flop: Max - maximum over all processors// >>> // Ratio - ratio of maximum to minimum >>> over all processors// >>> // Mess: number of messages sent// >>> // AvgLen: average message length (bytes)// >>> // Reduct: number of global reductions// >>> // Global: entire computation// >>> // Stage: stages of a computation. Set stages with >>> PetscLogStagePush() and PetscLogStagePop().// >>> // %T - percent time in this phase %F - >>> percent flop in this phase// >>> // %M - percent messages in this phase %L - >>> percent message lengths in this phase// >>> // %R - percent reductions in this phase// >>> // Total Mflop/s: 10e-6 * (sum of flop over all >>> processors)/(max time over all processors)// >>> // GPU Mflop/s: 10e-6 * (sum of flop on GPU over all >>> processors)/(max GPU time over all processors)// >>> // CpuToGpu Count: total number of CPU to GPU copies >>> per processor// >>> // CpuToGpu Size (Mbytes): 10e-6 * (total size of CPU >>> to GPU copies per processor)// >>> // GpuToCpu Count: total number of GPU to CPU copies >>> per processor// >>> // GpuToCpu Size (Mbytes): 10e-6 * (total size of GPU >>> to CPU copies per processor)// >>> // GPU %F: percent flops on GPU in this event// >>> >>> //------------------------------------------------------------------------------------------------------------------------// >>> //Event Count Time (sec) >>> Flop --- Global --- --- >>> Stage ---- Total GPU - CpuToGpu - - GpuToCpu - GPU// >>> // Max Ratio Max Ratio Max >>> Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M >>> %L %R Mflop/s Mflop/s Count Size Count Size %F// >>> >>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------// >>> // >>> //--- Event Stage 0: Main Stage// >>> // >>> //VecSet 37 1.0 1.0354e-04 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //VecAssemblyBegin 31 1.0 2.9080e-06 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //VecAssemblyEnd 31 1.0 2.3270e-06 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatCopy 49928 1.0 3.7437e+02 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 7 0 0 0 0 7 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatConvert 2080 1.0 5.8492e+00 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatScale 56162 1.0 6.9348e+02 1.0 1.60e+12 >>> 1.0 0.0e+00 0.0e+00 0.0e+00 14 3 0 0 0 14 3 0 0 >>> 0 2303 0 0 0.00e+00 0 0.00e+00 0// >>> //MatAssemblyBegin 56222 1.0 1.7370e-02 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatAssemblyEnd 56222 1.0 8.8713e-03 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatZeroEntries 60363 1.0 3.1011e+02 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 6 0 0 0 0 6 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatAXPY 8320 1.0 1.2254e+02 1.0 5.58e+11 >>> 1.0 0.0e+00 0.0e+00 0.0e+00 2 1 0 0 0 2 1 0 0 >>> 0 4557 0 0 0.00e+00 0 0.00e+00 0// >>> //MatMatMultSym 4161 1.0 7.1613e-03 1.0 0.00e+00 >>> 0.0 0.0e+00 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 >>> 0 0 0 0 0.00e+00 0 0.00e+00 0// >>> //MatMatMultNum 4161 1.0 4.0706e+02 1.0 5.02e+13 >>> 1.0 0.0e+00 0.0e+00 0.0e+00 8 96 0 0 0 8 96 0 0 >>> 0 123331 0 0 0.00e+00 0 0.00e+00 0// >>> >>> //---------------------------------------------------------------------------------------------------------------------------------------------------------------// >>> // >>> //Memory usage is given in bytes:// >>> // >>> //Object Type Creations Destructions >>> Memory Descendants' Mem.// >>> //Reports information only for process 0.// >>> // >>> //--- Event Stage 0: Main Stage// >>> // >>> // Vector 37 34 >>> 1634064 0.// >>> // Matrix 2120 2120 >>> 52734663456 0.// >>> // Viewer 1 0 >>> 0 0.// >>> >>> //========================================================================================================================/ >>> >>> Apparently, MatMatMultNum and MatScale take the most >>> time (by far) during execution. Therefore, I was >>> wondering if it is possible to move those operations/all >>> matrices and vectors to a GPU or another accelerator. >>> According to >>> https://www.mcs.anl.gov/petsc/features/gpus.html >>> <https://www.mcs.anl.gov/petsc/features/gpus.html> CUDA >>> is only supported for distributed vectors, but not for >>> dense distributed matrices. Are there any updates >>> related to that, or other ways to speed up the involved >>> operations? >>> >>> >>> You should compute the timings associated with each call, >>> and not consider the lump sum. For example, each MatScale >>> takes 6.9348e+02/56162 = 0.012347851 seconds on average, I >>> doubt you can get any reasonable speedup with CUDA. What are >>> the sizes of these matrices? >>> >>> >>> Thanks! >>> >>> Regards, >>> >>> Roland >>> >>> >>> >>> -- >>> Stefano >> >> >> >> -- >> Stefano > > > > -- > Stefano
