I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at
the end
ierr = WaitForGPU();CHKERRCUDA(ierr);
ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
The old code swapped the first two lines. Since with
-log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed
the
order to have better overlap.
ierr =
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
ierr =
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
ierr =
(*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know
execution time without -log_view (hence cuda
synchronization). I
manually calculated the Total Mflop/s for these cases for easy
comparison.
<<Note the CPU versions are copied from yesterday's results>>
------------------------------------------------------------------------------------------------------------------------
Event Count Time (sec)
Flop --- Global --- --- Stage
---- Total GPU -
CpuToGpu - - GpuToCpu - GPU
Max Ratio Max Ratio Max Ratio
Mess AvgLen Reduct %T %F %M %L %R %T %F %M %L %R
Mflop/s Mflop/s
Count Size Count Size %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
6 MPI ranks,
MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1
2.8e+03
2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743
0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0
2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0
0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 2.9441e+00 133 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
24 MPI ranks
MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2
1.9e+04
5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948
0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0
1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0
0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
42 MPI ranks
MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3
3.5e+04
4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493
0 0 0.00e+00 0 0.00e+00 0
VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0
3.5e+04
4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0
0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult 100 1.0 1.6863e-01 1.0 9.66e+09 1.1
2.8e+03
2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 335743
629278 100 1.02e+02 100 2.69e+02 100
VecScatterBegin 100 1.0 5.0157e-02 1.6 0.00e+00 0.0
2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 24 0100100 0 0
0 0 0.00e+00 100 2.69e+02 0
VecScatterEnd 100 1.0 4.9155e-02 2.5 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 20 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
VecCUDACopyTo 100 1.0 9.5078e-03 2.0 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 4 0 0 0 0 0
0 100 1.02e+02 0 0.00e+00 0
VecCopyFromSome 100 1.0 2.8485e-02 1.4 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 14 0 0 0 0 0
0 0 0.00e+00 100 2.69e+02 0
6 MPI ranks + 6 GPUs + regular SF + No log_view
MatMult: 100
1.0 1.4180e-01
399268
6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult 100 1.0 1.1053e-01 1.0 9.66e+09 1.1
2.8e+03
2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 512224
642075 0 0.00e+00 0 0.00e+00 100
VecScatterBegin 100 1.0 8.3418e-03 1.5 0.00e+00 0.0
2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 6 0100100 0 0
0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 2.2619e-02 1.6 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 16 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult: 100 1.0
9.8344e-02
575717
24 MPI ranks + 6 GPUs + regular SF + log_view
MatMult 100 1.0 1.1572e-01 1.0 2.63e+09 1.2
1.9e+04
5.9e+04 0.0e+00 0 99 97 25 0 100100100100 0 489223
708601 100 4.61e+01 100 6.72e+01 100
VecScatterBegin 100 1.0 2.0641e-02 2.0 0.00e+00 0.0
1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 13 0100100 0 0
0 0 0.00e+00 100 6.72e+01 0
VecScatterEnd 100 1.0 6.8114e-02 5.6 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 38 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
VecCUDACopyTo 100 1.0 6.6646e-03 2.5 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0
0 100 4.61e+01 0 0.00e+00 0
VecCopyFromSome 100 1.0 1.0546e-02 1.7 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 7 0 0 0 0 0
0 0 0.00e+00 100 6.72e+01 0
24 MPI ranks + 6 GPUs + regular SF + No log_view
MatMult: 100 1.0
9.8254e-02
576201
24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult 100 1.0 1.1602e-01 1.0 2.63e+09 1.2
1.9e+04
5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 487956
707524 0 0.00e+00 0 0.00e+00 100
VecScatterBegin 100 1.0 2.7088e-02 7.0 0.00e+00 0.0
1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 8 0100100 0 0
0 0 0.00e+00 0 0.00e+00 0
VecScatterEnd 100 1.0 8.4262e-02 3.0 0.00e+00 0.0
0.0e+00
0.0e+00 0.0e+00 1 0 0 0 0 52 0 0 0 0 0
0 0 0.00e+00 0 0.00e+00 0
24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult: 100 1.0
1.0397e-01
544510