Re: [petsc-dev] MatMult on Summit

Smith, Barry F. via petsc-dev Sat, 21 Sep 2019 09:11:55 -0700


  Sorry,  forgot


      Could you please put the GPU wait call before each of the log ends in 
that routine and see what kind of new numbers you get?

   Thanks

     Barry


> On Sep 21, 2019, at 11:00 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
> 
> We log gpu time before/after cusparse calls. 
> https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441
> But according to 
> https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, 
> cusparse is asynchronous. Does that mean the gpu time is meaningless?
> --Junchao Zhang
> 
> 
> On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. <bsm...@mcs.anl.gov> wrote:
> 
>    Hannah, Junchao and Richard,
> 
>     The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally 
> funky. 951558 and 973391 they are so much lower than unvirtualized 3084009
>   and 3133521 and yet the total time to solution is similar for the runs.
> 
>     Is it possible these are being counted or calculated wrong? If not what 
> does this mean? Please check the code that computes them (I can't imagine it 
> is wrong but ...)
> 
>     It means the GPUs are taking 3.x times more to do the multiplies in the 
> MPS case but where is that time coming from in the other numbers? 
> Communication time doesn't drop that much?
> 
>     I can't present these numbers with this huge inconsistency
> 
> Thanks,
> 
>    Barry
> 
> 
> 
> 
> > On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev 
> > <petsc-dev@mcs.anl.gov> wrote:
> > 
> > I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. 
> > Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one 
> > node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult 
> > was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + 
> > 6 GPUs,  I found CUDA aware SF could improve performance. But if I enabled 
> > Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA 
> > aware SF hurt performance. I don't know why and have to profile it. I will 
> > also collect  data with multiple nodes. Are the matrix and tests proper?
> > 
> > ------------------------------------------------------------------------------------------------------------------------
> > Event                Count      Time (sec)     Flop                         
> >      --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - 
> > GpuToCpu - GPU
> >                    Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  
> > Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count 
> >   Size  %F
> > ---------------------------------------------------------------------------------------------------------------------------------------------------------------
> > 6 MPI ranks (CPU version)
> > MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 2.9441e+00133  0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 6 MPI ranks + 6 GPUs + regular SF
> > MatMult              100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  0 99 97 18  0 100100100100  0 318057   3084009 100 1.02e+02  100 
> > 2.69e+02 100
> > VecScatterBegin      100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  0  0 97 18  0  64  0100100  0     0       0      0 0.00e+00  100 
> > 2.69e+02  0
> > VecScatterEnd        100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  22  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecCUDACopyTo        100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0   5  0  0  0  0     0       0    100 1.02e+02    0 
> > 0.00e+00  0
> > VecCopyFromSome      100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  54  0  0  0  0     0       0      0 0.00e+00  100 
> > 2.69e+02  0
> > 
> > 6 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult              100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
> > 0.0e+00  1 99 97 18  0 100100100100  0 509496   3133521   0 0.00e+00    0 
> > 0.00e+00 100
> > VecScatterBegin      100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 
> > 0.0e+00  1  0 97 18  0  70  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  17  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 24 MPI ranks + 6 GPUs + regular SF
> > MatMult              100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  1 99 97 25  0 100100100100  0 510337   951558  100 4.61e+01  100 
> > 6.72e+01 100
> > VecScatterBegin      100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 
> > 0.0e+00  0  0 97 25  0  34  0100100  0     0       0      0 0.00e+00  100 
> > 6.72e+01  0
> > VecScatterEnd        100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  1  0  0  0  0  42  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecCUDACopyTo        100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 
> > 0.00e+00  0
> > VecCopyFromSome      100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  0  0  0  0  0  29  0  0  0  0     0       0      0 0.00e+00  100 
> > 6.72e+01  0
> > 
> > 24 MPI ranks + 6 GPUs + CUDA-aware SF
> > MatMult              100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 
> > 0.0e+00  1 99 97 25  0 100100100100  0 387864   973391    0 0.00e+00    0 
> > 0.00e+00 100
> > VecScatterBegin      100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 
> > 0.0e+00  1  0 97 25  0  35  0100100  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > VecScatterEnd        100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 
> > 0.0e+00  1  0  0  0  0  48  0  0  0  0     0       0      0 0.00e+00    0 
> > 0.00e+00  0
> > 
> > 
> > --Junchao Zhang
>

Re: [petsc-dev] MatMult on Summit

Reply via email to