Yes, please, thank you. On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev < petsc-dev@mcs.anl.gov> wrote:
> Karl, that would be fantastic. Much obliged! > > --Richard > > On 9/23/19 8:09 PM, Karl Rupp wrote: > > Hi, > > `git grep cudaStreamCreate` reports that vectors, matrices and scatters > create their own streams. This will almost inevitably create races (there > is no synchronization mechanism implemented), unless one calls WaitForGPU() > after each operation. Some of the non-deterministic tests can likely be > explained by this. > > I'll clean this up in the next few hours if there are no objections. > > Best regards, > Karli > > > > On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote: > > I'm no CUDA expert (not yet, anyway), but, from what I've read, the > default stream (stream 0) is (mostly) synchronous to host and device, so > WaitForGPU() is not needed in that case. I don't know if there is any > performance penalty in explicitly calling it in that case, anyway. > > In any case, it looks like there are still some cases where potentially > asynchronous CUDA library calls are being "timed" without a WaitForGPU() to > ensure that the calls actually complete. I will make a pass through the > aijcusparse and aijviennacl code looking for these. > > --Richard > > On 9/23/19 3:28 PM, Zhang, Junchao wrote: > > It looks cusparsestruct->stream is always created (not NULL). I don't > know logic of the "if (!cusparsestruct->stream)". > --Junchao Zhang > > > On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via petsc-dev < > petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov> > <petsc-dev@mcs.anl.gov>> wrote: > > In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards > the end of the function it had > > if (!yy) { /* MatMult */ > if (!cusparsestruct->stream) { > ierr = WaitForGPU();CHKERRCUDA(ierr); > } > } > > I assume we don't need the logic to do this only in the MatMult() > with no add case and should just do this all the time, for the > purposes of timing if no other reason. Is there some reason to NOT > do this because of worries the about effects that these > WaitForGPU() invocations might have on performance? > > I notice other problems in aijcusparse.cu <http://aijcusparse.cu> > <http://aijcusparse.cu>, > now that I look closer. In MatMultTransposeAdd_SeqAIJCUSPARSE(), I > see that we have GPU timing calls around the cusparse_csr_spmv() > (but no WaitForGPU() inside the timed region). I believe this is > another area in which we get a meaningless timing. It looks like > we need a WaitForGPU() there, and then maybe inside the timed > region handling the scatter. (I don't know if this stuff happens > asynchronously or not.) But do we potentially want two > WaitForGPU() calls in one function, just to help with getting > timings? I don't have a good idea of how much overhead this adds. > > --Richard > > On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote: > > I made the following changes: > 1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at the end > ierr = WaitForGPU();CHKERRCUDA(ierr); > ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr); > ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr); > PetscFunctionReturn(0); > 2) In MatMult_MPIAIJCUSPARSE, use the following code sequence. > The old code swapped the first two lines. Since with > -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed the > order to have better overlap. > ierr = > > VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); > ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr); > ierr = > > VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr); > ierr = (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr); > 3) Log time directly in the test code so we can also know > execution time without -log_view (hence cuda synchronization). I > manually calculated the Total Mflop/s for these cases for easy > comparison. > > <<Note the CPU versions are copied from yesterday's results>> > > > ------------------------------------------------------------------------------------------------------------------------ > Event Count Time (sec) Flop > --- Global --- --- Stage ---- Total GPU - > CpuToGpu - - GpuToCpu - GPU > Max Ratio Max Ratio Max Ratio Mess > AvgLen Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s Mflop/s > Count Size Count Size %F > > --------------------------------------------------------------------------------------------------------------------------------------------------------------- > 6 MPI ranks, > MatMult 100 1.0 1.1895e+01 1.0 9.63e+09 1.1 2.8e+03 > 2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 4743 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterBegin 100 1.0 4.9145e-02 3.0 0.00e+00 0.0 2.8e+03 > 2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterEnd 100 1.0 2.9441e+00 133 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > > 24 MPI ranks > MatMult 100 1.0 3.1431e+00 1.0 2.63e+09 1.2 1.9e+04 > 5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 17948 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterBegin 100 1.0 2.0583e-02 2.3 0.00e+00 0.0 1.9e+04 > 5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterEnd 100 1.0 1.0639e+0050.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > > 42 MPI ranks > MatMult 100 1.0 2.0519e+00 1.0 1.52e+09 1.3 3.5e+04 > 4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 27493 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterBegin 100 1.0 2.0971e-02 3.4 0.00e+00 0.0 3.5e+04 > 4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterEnd 100 1.0 8.5184e-0162.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > > 6 MPI ranks + 6 GPUs + regular SF + log_view > MatMult 100 1.0 1.6863e-01 1.0 9.66e+09 1.1 2.8e+03 > 2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 335743 629278 100 > 1.02e+02 100 2.69e+02 100 > VecScatterBegin 100 1.0 5.0157e-02 1.6 0.00e+00 0.0 2.8e+03 > 2.2e+05 0.0e+00 0 0 97 18 0 24 0100100 0 0 0 0 > 0.00e+00 100 2.69e+02 0 > VecScatterEnd 100 1.0 4.9155e-02 2.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 20 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecCUDACopyTo 100 1.0 9.5078e-03 2.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 4 0 0 0 0 0 0 100 > 1.02e+02 0 0.00e+00 0 > VecCopyFromSome 100 1.0 2.8485e-02 1.4 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 14 0 0 0 0 0 0 0 > 0.00e+00 100 2.69e+02 0 > > 6 MPI ranks + 6 GPUs + regular SF + No log_view > MatMult: 100 1.0 1.4180e-01 > 399268 > > 6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view > MatMult 100 1.0 1.1053e-01 1.0 9.66e+09 1.1 2.8e+03 > 2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 512224 642075 0 > 0.00e+00 0 0.00e+00 100 > VecScatterBegin 100 1.0 8.3418e-03 1.5 0.00e+00 0.0 2.8e+03 > 2.2e+05 0.0e+00 0 0 97 18 0 6 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterEnd 100 1.0 2.2619e-02 1.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 16 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > > 6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view > MatMult: 100 1.0 9.8344e-02 > 575717 > > 24 MPI ranks + 6 GPUs + regular SF + log_view > MatMult 100 1.0 1.1572e-01 1.0 2.63e+09 1.2 1.9e+04 > 5.9e+04 0.0e+00 0 99 97 25 0 100100100100 0 489223 708601 100 > 4.61e+01 100 6.72e+01 100 > VecScatterBegin 100 1.0 2.0641e-02 2.0 0.00e+00 0.0 1.9e+04 > 5.9e+04 0.0e+00 0 0 97 25 0 13 0100100 0 0 0 0 > 0.00e+00 100 6.72e+01 0 > VecScatterEnd 100 1.0 6.8114e-02 5.6 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 38 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecCUDACopyTo 100 1.0 6.6646e-03 2.5 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 0 0 100 > 4.61e+01 0 0.00e+00 0 > VecCopyFromSome 100 1.0 1.0546e-02 1.7 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 0 0 0 0 0 7 0 0 0 0 0 0 0 > 0.00e+00 100 6.72e+01 0 > > 24 MPI ranks + 6 GPUs + regular SF + No log_view > MatMult: 100 1.0 9.8254e-02 > 576201 > > 24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view > MatMult 100 1.0 1.1602e-01 1.0 2.63e+09 1.2 1.9e+04 > 5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 487956 707524 0 > 0.00e+00 0 0.00e+00 100 > VecScatterBegin 100 1.0 2.7088e-02 7.0 0.00e+00 0.0 1.9e+04 > 5.9e+04 0.0e+00 0 0 97 25 0 8 0100100 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > VecScatterEnd 100 1.0 8.4262e-02 3.0 0.00e+00 0.0 0.0e+00 > 0.0e+00 0.0e+00 1 0 0 0 0 52 0 0 0 0 0 0 0 > 0.00e+00 0 0.00e+00 0 > > 24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view > MatMult: 100 1.0 1.0397e-01 > 544510 > > > > > > > > >