Re: [petsc-dev] MatMult on Summit

Karl Rupp via petsc-dev Sat, 21 Sep 2019 21:09:32 -0700

Hi Junchao,

thanks, these numbers are interesting.

Do you have an easy way to evaluate the benefits of a CUDA-aware MPI vs.a non-CUDA-aware MPI that still keeps the benefits of yourpacking/unpacking routines?

I'd like to get a feeling of where the performance gains come from. Isit due to the reduced PCI-Express transfer for the scatters (i.e.packing/unpacking and transferring only the relevant entries) on eachrank, or is it some low-level optimization that makes the MPI-part ofthe communication faster? Your current MR includes both; it would behelpful to know whether we can extract similar benefits for other GPUbackends without having to require "CUDA-awareness" of MPI. If thebenefits are mostly due to the packing/unpacking, we could carry overthe benefits to other GPU backends (e.g. upcoming Intel GPUs) withouthaving to wait for an "Intel-GPU-aware MPI".


Best regards,
Karli


On 9/21/19 6:22 AM, Zhang, Junchao via petsc-dev wrote:

I downloaded a sparse matrix (HV15R<https://sparse.tamu.edu/Fluorem/HV15R>) from Florida Sparse MatrixCollection. Its size is about 2M x 2M. Then I ran the same MatMult 100times on one node of Summit with -mat_type aijcusparse -vec_type cuda. Ifound MatMult was almost dominated by VecScatter in this simple test.Using 6 MPI ranks + 6 GPUs, I found CUDA aware SF could improveperformance. But if I enabled Multi-Process Service on Summit and used24 ranks + 6 GPUs, I found CUDA aware SF hurt performance. I don't knowwhy and have to profile it. I will also collect data with multiplenodes. Are the matrix and tests proper?
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec) Flop --- Global --- --- Stage ---- Total GPU - CpuToGpu -- GpuToCpu - GPU Max Ratio Max Ratio Max Ratio Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count SizeCount Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
6 MPI ranks (CPU version)
MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+050.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 0.00e+00 0VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+050.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
6 MPI ranks + 6 GPUs + regular SF
MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+050.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 1.02e+02 100 2.69e+02 100VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+050.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 0.00e+00 100 2.69e+02 0VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 1.02e+02 0 0.00e+00 0VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 0.00e+00 100 2.69e+02 0
6 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+050.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 0.00e+00 0 0.00e+00 100VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+050.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
24 MPI ranks + 6 GPUs + regular SF
MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+040.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 4.61e+01 100 6.72e+01 100VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+040.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 0.00e+00 100 6.72e+01 0VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 0.00e+00 0VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 0.00e+00 100 6.72e+01 0
24 MPI ranks + 6 GPUs + CUDA-aware SF
MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+040.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 0.00e+00 0 0.00e+00 100VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+040.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 0.00e+00 0 0.00e+00 0VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+000.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 0.00e+00 0 0.00e+00 0
--Junchao Zhang

Re: [petsc-dev] MatMult on Summit

Reply via email to