Sorry, forgot
Could you please put the GPU wait call before each of the log ends in that routine and see what kind of new numbers you get? Thanks Barry > On Sep 21, 2019, at 11:00 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > > We log gpu time before/after cusparse calls. > https://gitlab.com/petsc/petsc/blob/master/src%2Fmat%2Fimpls%2Faij%2Fseq%2Fseqcusparse%2Faijcusparse.cu#L1441 > But according to > https://docs.nvidia.com/cuda/cusparse/index.html#asynchronous-execution, > cusparse is asynchronous. Does that mean the gpu time is meaningless? > --Junchao Zhang > > > On Sat, Sep 21, 2019 at 8:30 AM Smith, Barry F. <bsm...@mcs.anl.gov> wrote: > > Hannah, Junchao and Richard, > > The on-GPU flop rates for 24 MPI ranks and 24 MPS GPUs looks totally > funky. 951558 and 973391 they are so much lower than unvirtualized 3084009 > and 3133521 and yet the total time to solution is similar for the runs. > > Is it possible these are being counted or calculated wrong? If not what > does this mean? Please check the code that computes them (I can't imagine it > is wrong but ...) > > It means the GPUs are taking 3.x times more to do the multiplies in the > MPS case but where is that time coming from in the other numbers? > Communication time doesn't drop that much? > > I can't present these numbers with this huge inconsistency > > Thanks, > > Barry > > > > > > On Sep 20, 2019, at 11:22 PM, Zhang, Junchao via petsc-dev > > <petsc-dev@mcs.anl.gov> wrote: > > > > I downloaded a sparse matrix (HV15R) from Florida Sparse Matrix Collection. > > Its size is about 2M x 2M. Then I ran the same MatMult 100 times on one > > node of Summit with -mat_type aijcusparse -vec_type cuda. I found MatMult > > was almost dominated by VecScatter in this simple test. Using 6 MPI ranks + > > 6 GPUs, I found CUDA aware SF could improve performance. But if I enabled > > Multi-Process Service on Summit and used 24 ranks + 6 GPUs, I found CUDA > > aware SF hurt performance. I don't know why and have to profile it. I will > > also collect data with multiple nodes. Are the matrix and tests proper? > > > > ------------------------------------------------------------------------------------------------------------------------ > > Event Count Time (sec) Flop > > --- Global --- --- Stage ---- Total GPU - CpuToGpu - - > > GpuToCpu - GPU > > Max Ratio Max Ratio Max Ratio Mess AvgLen > > Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s Count Size Count > > Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > > 6 MPI ranks (CPU version) > > MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 > > 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 0.00e+00 0 > > 0.00e+00 0 > > VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 > > 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > VecScatterEnd 100 1.0 2.9441e+00133 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > > > 6 MPI ranks + 6 GPUs + regular SF > > MatMult 100 1.0 1.7800e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 > > 0.0e+00 0 99 97 18 0 100100100100 0 318057 3084009 100 1.02e+02 100 > > 2.69e+02 100 > > VecScatterBegin 100 1.0 1.2786e-01 1.3 0.00e+00 0.0 2.8e+03 2.2e+05 > > 0.0e+00 0 0 97 18 0 64 0100100 0 0 0 0 0.00e+00 100 > > 2.69e+02 0 > > VecScatterEnd 100 1.0 6.2196e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 22 0 0 0 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > VecCUDACopyTo 100 1.0 1.0850e-02 2.3 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 5 0 0 0 0 0 0 100 1.02e+02 0 > > 0.00e+00 0 > > VecCopyFromSome 100 1.0 1.0263e-01 1.2 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 54 0 0 0 0 0 0 0 0.00e+00 100 > > 2.69e+02 0 > > > > 6 MPI ranks + 6 GPUs + CUDA-aware SF > > MatMult 100 1.0 1.1112e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 > > 0.0e+00 1 99 97 18 0 100100100100 0 509496 3133521 0 0.00e+00 0 > > 0.00e+00 100 > > VecScatterBegin 100 1.0 7.9461e-02 1.1 0.00e+00 0.0 2.8e+03 2.2e+05 > > 0.0e+00 1 0 97 18 0 70 0100100 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > VecScatterEnd 100 1.0 2.2805e-02 1.5 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 17 0 0 0 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > > > 24 MPI ranks + 6 GPUs + regular SF > > MatMult 100 1.0 1.1094e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 > > 0.0e+00 1 99 97 25 0 100100100100 0 510337 951558 100 4.61e+01 100 > > 6.72e+01 100 > > VecScatterBegin 100 1.0 4.8966e-02 1.8 0.00e+00 0.0 1.9e+04 5.9e+04 > > 0.0e+00 0 0 97 25 0 34 0100100 0 0 0 0 0.00e+00 100 > > 6.72e+01 0 > > VecScatterEnd 100 1.0 7.2969e-02 4.9 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 1 0 0 0 0 42 0 0 0 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > VecCUDACopyTo 100 1.0 4.4487e-03 1.8 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 4.61e+01 0 > > 0.00e+00 0 > > VecCopyFromSome 100 1.0 4.3315e-02 1.9 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 0 0 0 0 0 29 0 0 0 0 0 0 0 0.00e+00 100 > > 6.72e+01 0 > > > > 24 MPI ranks + 6 GPUs + CUDA-aware SF > > MatMult 100 1.0 1.4597e-01 1.2 2.63e+09 1.2 1.9e+04 5.9e+04 > > 0.0e+00 1 99 97 25 0 100100100100 0 387864 973391 0 0.00e+00 0 > > 0.00e+00 100 > > VecScatterBegin 100 1.0 6.4899e-02 2.9 0.00e+00 0.0 1.9e+04 5.9e+04 > > 0.0e+00 1 0 97 25 0 35 0100100 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > VecScatterEnd 100 1.0 1.1179e-01 4.1 0.00e+00 0.0 0.0e+00 0.0e+00 > > 0.0e+00 1 0 0 0 0 48 0 0 0 0 0 0 0 0.00e+00 0 > > 0.00e+00 0 > > > > > > --Junchao Zhang >