I'm no CUDA expert (not yet, anyway), but, from what I've read, the default 
stream (stream 0) is (mostly) synchronous to host and device, so WaitForGPU() 
is not needed in that case. I don't know if there is any performance penalty in 
explicitly calling it in that case, anyway.

In any case, it looks like there are still some cases where potentially 
asynchronous CUDA library calls are being "timed" without a WaitForGPU() to 
ensure that the calls actually complete. I will make a pass through the 
aijcusparse and aijviennacl code looking for these.

--Richard

On 9/23/19 3:28 PM, Zhang, Junchao wrote:
It looks cusparsestruct->stream is always created (not NULL).  I don't know 
logic of the "if (!cusparsestruct->stream)".
--Junchao Zhang


On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev 
<petsc-dev@mcs.anl.gov<mailto:petsc-dev@mcs.anl.gov>> wrote:
In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards the end of the 
function it had

  if (!yy) { /* MatMult */
    if (!cusparsestruct->stream) {
      ierr = WaitForGPU();CHKERRCUDA(ierr);
    }
  }

I assume we don't need the logic to do this only in the MatMult() with no add 
case and should just do this all the time, for the purposes of timing if no 
other reason. Is there some reason to NOT do this because of worries the about 
effects that these WaitForGPU() invocations might have on performance?

I notice other problems in aijcusparse.cu<http://aijcusparse.cu>, now that I 
look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I see that we have GPU 
timing calls around the cusparse_csr_spmv() (but no WaitForGPU() inside the 
timed region). I believe this is another area in which we get a meaningless 
timing. It looks like we need a WaitForGPU() there, and then maybe inside the 
timed region handling the scatter. (I don't know if this stuff happens 
asynchronously or not.) But do we potentially want two WaitForGPU() calls in 
one function, just to help with getting timings? I don't have a good idea of 
how much overhead this adds.

--Richard

On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
I made the following changes:
1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end
  ierr = WaitForGPU();CHKERRCUDA(ierr);
  ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
  ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
  PetscFunctionReturn(0);
2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. The old code 
swapped the first two lines. Since with -log_view, MatMultAdd_SeqAIJCUSPARSE is 
blocking, I changed the order to have better overlap.
  ierr = 
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
  ierr = 
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
  ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
3) Log time directly in the test code so we can also know execution time 
without -log_view (hence cuda synchronization). I manually calculated the Total 
Mflop/s for these cases for easy comparison.

<<Note the CPU versions are copied from yesterday's results>>

------------------------------------------------------------------------------------------------------------------------
Event                Count      Time (sec)     Flop                             
 --- Global ---  --- Stage ----  Total   GPU    - CpuToGpu -   - GpuToCpu - GPU
                   Max Ratio  Max     Ratio   Max  Ratio  Mess   AvgLen  Reduct 
 %T %F %M %L %R  %T %F %M %L %R Mflop/s Mflop/s Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
6 MPI ranks,
MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 2.2e+05 
0.0e+00 24 99 97 18  0 100100100100  0  4743       0      0 0.00e+00    0 
0.00e+00  0
VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   0  0100100  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0  13  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0

24 MPI ranks
MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  8 99 97 25  0 100100100100  0 17948       0      0 0.00e+00    0 
0.00e+00  0
VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   0  0100100  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  2  0  0  0  0  19  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0

42 MPI ranks
MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 4.1e+04 
0.0e+00 23 99 97 30  0 100100100100  0 27493       0      0 0.00e+00    0 
0.00e+00  0
VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 4.1e+04 
0.0e+00  0  0 97 30  0   1  0100100  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  6  0  0  0  0  24  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0

6 MPI ranks + 6 GPUs + regular SF + log_view
MatMult              100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  0 99 97 18  0 100100100100  0 335743   629278  100 1.02e+02  100 
2.69e+02 100
VecScatterBegin      100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0  24  0100100  0     0       0      0 0.00e+00  100 
2.69e+02  0
VecScatterEnd        100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  20  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecCUDACopyTo        100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   4  0  0  0  0     0       0    100 1.02e+02    0 
0.00e+00  0
VecCopyFromSome      100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  14  0  0  0  0     0       0      0 0.00e+00  100 
2.69e+02  0

6 MPI ranks + 6 GPUs + regular SF  + No log_view
MatMult:             100 1.0 1.4180e-01                                         
                                399268

6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult              100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 2.2e+05 
0.0e+00  1 99 97 18  0 100100100100  0 512224   642075    0 0.00e+00    0 
0.00e+00 100
VecScatterBegin      100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 2.2e+05 
0.0e+00  0  0 97 18  0   6  0100100  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecScatterEnd        100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  16  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0

6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult:             100 1.0 9.8344e-02                                         
                                575717

24 MPI ranks + 6 GPUs + regular SF + log_view
MatMult              100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  0 99 97 25  0 100100100100  0 489223   708601  100 4.61e+01  100 
6.72e+01 100
VecScatterBegin      100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0  13  0100100  0     0       0      0 0.00e+00  100 
6.72e+01  0
VecScatterEnd        100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0  38  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecCUDACopyTo        100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   3  0  0  0  0     0       0    100 4.61e+01    0 
0.00e+00  0
VecCopyFromSome      100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  0  0  0  0  0   7  0  0  0  0     0       0      0 0.00e+00  100 
6.72e+01  0

24 MPI ranks + 6 GPUs + regular SF + No log_view
MatMult:             100 1.0 9.8254e-02                                         
                                576201

24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
MatMult              100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04 5.9e+04 
0.0e+00  1 99 97 25  0 100100100100  0 487956   707524    0 0.00e+00    0 
0.00e+00 100
VecScatterBegin      100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04 5.9e+04 
0.0e+00  0  0 97 25  0   8  0100100  0     0       0      0 0.00e+00    0 
0.00e+00  0
VecScatterEnd        100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00  1  0  0  0  0  52  0  0  0  0     0       0      0 0.00e+00    0 
0.00e+00  0

24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
MatMult:             100 1.0 1.0397e-01                                         
                                544510

[https://ssl.gstatic.com/ui/v1/icons/mail/images/cleardot.gif]






Reply via email to