For an AIJ matrix with 32-bit integers, this is 1 flops/6 bytes, or 165 GB/s for the node for the best case (42 ranks).
My understanding is that these systems have 8 channels of DDR4-2666 per socket, which is ~340 GB/s of theoretical bandwidth on a 2-socket system, and 270 GB/s STREAM Triad according to this post https://openpowerblog.wordpress.com/2018/07/19/epyc-skylake-vs-power9-stream-memory-bandwidth-comparison-via-zaius-barreleye-g2/ Is this 60% of Triad the best we can get for SpMV? "Zhang, Junchao via petsc-dev" <petsc-dev@mcs.anl.gov> writes: > 42 cores have better performance. > > 36 MPI ranks > MatMult 100 1.0 2.2435e+00 1.0 1.75e+09 1.3 2.9e+04 4.5e+04 > 0.0e+00 6 99 97 28 0 100100100100 0 25145 0 0 0.00e+00 0 > 0.00e+00 0 > VecScatterBegin 100 1.0 2.1869e-02 3.3 0.00e+00 0.0 2.9e+04 4.5e+04 > 0.0e+00 0 0 97 28 0 1 0100100 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > VecScatterEnd 100 1.0 7.9205e-0152.6 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 1 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 > 0.00e+00 0 > > --Junchao Zhang > > > On Sat, Sep 21, 2019 at 9:41 PM Smith, Barry F. > <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote: > > Junchao, > > Mark has a good point; could you also try for completeness the CPU with > 36 cores and see if it is any better than the 42 core case? > > Barry > > So extrapolating about 20 nodes of the CPUs is equivalent to 1 node of the > GPUs for the multiply for this problem size. > >> On Sep 21, 2019, at 6:40 PM, Mark Adams >> <mfad...@lbl.gov<mailto:mfad...@lbl.gov>> wrote: >> >> I came up with 36 cores/node for CPU GAMG runs. The memory bus is pretty >> saturated at that point. >> >> On Sat, Sep 21, 2019 at 1:44 AM Zhang, Junchao via petsc-dev >> <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: >> Here are CPU version results on one node with 24 cores, 42 cores. Click the >> links for core layout. >> >> 24 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= >> MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 >> 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 >> 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> 42 MPI ranks, https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c7g1r17d1b21l0= >> MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 >> 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 >> 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 0.00e+00 0 >> 0.00e+00 0 >> >> --Junchao Zhang >> >> >> On Fri, Sep 20, 2019 at 11:48 PM Smith, Barry F. >> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote: >> >> Junchao, >> >> Very interesting. For completeness please run also 24 and 42 CPUs without >> the GPUs. Note that the default layout for CPU cores is not good. You will >> want 3 cores on each socket then 12 on each. >> >> Thanks >> >> Barry >> >> Since Tim is one of our reviewers next week this is a very good test >> matrix :-) >> >> >> > On Sep 20, 2019, at 11:39 PM, Zhang, Junchao via petsc-dev >> > <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: >> > >> > Click the links to visualize it. >> > >> > 6 ranks >> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c1g1r11d1b21l0= >> > jsrun -n 6 -a 1 -c 1 -g 1 -r 6 --latency_priority GPU-GPU >> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f >> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view >> > >> > 24 ranks >> > https://jsrunvisualizer.olcf.ornl.gov/?s4f1o01n6c4g1r14d1b21l0= >> > jsrun -n 6 -a 4 -c 4 -g 1 -r 6 --latency_priority GPU-GPU >> > --launch_distribution packed --bind packed:1 js_task_info ./ex900 -f >> > HV15R.aij -mat_type aijcusparse -vec_type cuda -n 100 -log_view >> > >> > --Junchao Zhang >> > >> > >> > On Fri, Sep 20, 2019 at 11:34 PM Mills, Richard Tran via petsc-dev >> > <petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote: >> > Junchao, >> > >> > Can you share your 'jsrun' command so that we can see how you are mapping >> > things to resource sets? >> > >> > --Richard >> > >> > On 9/20/19 11:22 PM, Zhang, Junchao via petsc-dev wrote: >> >> I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix >> >> Collection. Its size is about 2M x 2M. Then I ran the same MatMult 100 >> >> times on one node of Summit with -mat_type aijcusparse -vec_type cuda. I >> >> found MatMult was almost dominated by VecScatter in this simple test. >> >> Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improve >> >> performance. But if I enabled Multi-Process Service on Summit and used 24 >> >> ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't know why >> >> and have to profile it. I will also collect data with multiple nodes. >> >> Are the matrix and tests proper? >> >> >> >> ------------------------------------------------------------------------------------------------------------------------ >> >> Event Count Time (sec) Flop >> >> --- Global --- --- Stage ---- Total GPU - CpuToGpu - - >> >> GpuToCpu - GPU >> >> Max Ratio Max Ratio Max Ratio Mess AvgLen >> >> Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size >> >> Count Size %F >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> 6 MPI ranks (CPU version) >> >> MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 >> >> 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 >> >> 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> >> >> 6 MPI ranks + 6 GPUs + regular SF >> >> MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 >> >> 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 1.02e+02 100 >> >> 2.69e+02 100 >> >> VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 >> >> 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 0.00e+00 100 >> >> 2.69e+02 0 >> >> VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 1.02e+02 0 >> >> 0.00e+00 0 >> >> VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 0.00e+00 100 >> >> 2.69e+02 0 >> >> >> >> 6 MPI ranks + 6 GPUs + CUDA-aware SF >> >> MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 >> >> 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 0.00e+00 0 >> >> 0.00e+00 100 >> >> VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 >> >> 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> >> >> 24 MPI ranks + 6 GPUs + regular SF >> >> MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 >> >> 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 4.61e+01 100 >> >> 6.72e+01 100 >> >> VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 >> >> 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 0.00e+00 100 >> >> 6.72e+01 0 >> >> VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 >> >> 0.00e+00 0 >> >> VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 0.00e+00 100 >> >> 6.72e+01 0 >> >> >> >> 24 MPI ranks + 6 GPUs + CUDA-aware SF >> >> MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 >> >> 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 0.00e+00 0 >> >> 0.00e+00 100 >> >> VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 >> >> 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 >> >> 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 0.00e+00 0 >> >> 0.00e+00 0 >> >> >> >> >> >> --Junchao Zhang >> > >>