Re: [petsc-dev] MatMult on Summit

Karl Rupp via petsc-dev Tue, 24 Sep 2019 05:38:25 -0700

Hi Mark, Richard, Junchao, et al.,

here we go:
https://gitlab.com/petsc/petsc/merge_requests/2091

This fixes indeed all the inconsistencies in test results for SNES ex19and even ex56. A-priori I wasn't sure about the latter, but it lookslike this was the only missing piece.


Mark, this should allow you to move forward with GPUs.

Best regards,
Karli



On 9/24/19 11:05 AM, Mark Adams wrote:

Yes, please, thank you.

On Tue, Sep 24, 2019 at 1:46 AM Mills, Richard Tran via petsc-dev<petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>> wrote:


    Karl, that would be fantastic. Much obliged!

    --Richard

    On 9/23/19 8:09 PM, Karl Rupp wrote:

    Hi,

    `git grep cudaStreamCreate` reports that vectors, matrices and
    scatters create their own streams. This will almost inevitably
    create races (there is no synchronization mechanism implemented),
    unless one calls WaitForGPU() after each operation. Some of the
    non-deterministic tests can likely be explained by this.

    I'll clean this up in the next few hours if there are no objections.

    Best regards,
    Karli



    On 9/24/19 1:05 AM, Mills, Richard Tran via petsc-dev wrote:

    I'm no CUDA expert (not yet, anyway), but, from what I've read,
    the default stream (stream 0) is (mostly) synchronous to host and
    device, so WaitForGPU() is not needed in that case. I don't know
    if there is any performance penalty in explicitly calling it in
    that case, anyway.

    In any case, it looks like there are still some cases where
    potentially asynchronous CUDA library calls are being "timed"
    without a WaitForGPU() to ensure that the calls actually
    complete. I will make a pass through the aijcusparse and
    aijviennacl code looking for these.

    --Richard

    On 9/23/19 3:28 PM, Zhang, Junchao wrote:

    It looks cusparsestruct->stream is always created (not NULL).  I
    don't know logic of the "if (!cusparsestruct->stream)".
    --Junchao Zhang


    On Mon, Sep 23, 2019 at 5:04 PM Mills, Richard Tran via
    petsc-dev <petsc-dev@mcs.anl.gov <mailto:petsc-dev@mcs.anl.gov>
    <mailto:petsc-dev@mcs.anl.gov> <mailto:petsc-dev@mcs.anl.gov>>
    wrote:

        In MatMultAdd_SeqAIJCUSPARSE, before Junchao's changes, towards
        the end of the function it had

          if (!yy) { /* MatMult */
            if (!cusparsestruct->stream) {
              ierr = WaitForGPU();CHKERRCUDA(ierr);
            }
          }

        I assume we don't need the logic to do this only in the
    MatMult()
        with no add case and should just do this all the time, for the
        purposes of timing if no other reason. Is there some reason
    to NOT
        do this because of worries the about effects that these
        WaitForGPU() invocations might have on performance?

        I notice other problems in aijcusparse.cu
    <http://aijcusparse.cu> <http://aijcusparse.cu>
    <http://aijcusparse.cu>,
        now that I look closer. In
    MatMultTransposeAdd_SeqAIJCUSPARSE(), I
        see that we have GPU timing calls around the
    cusparse_csr_spmv()
        (but no WaitForGPU() inside the timed region). I believe
    this is
        another area in which we get a meaningless timing. It looks
    like
        we need a WaitForGPU() there, and then maybe inside the timed
        region handling the scatter. (I don't know if this stuff
    happens
        asynchronously or not.) But do we potentially want two
        WaitForGPU() calls in one function, just to help with getting
        timings? I don't have a good idea of how much overhead this
    adds.

        --Richard

        On 9/21/19 12:03 PM, Zhang, Junchao via petsc-dev wrote:
        I made the following changes:
        1) In MatMultAdd_SeqAIJCUSPARSE, use this code sequence at
    the end
          ierr = WaitForGPU();CHKERRCUDA(ierr);
          ierr = PetscLogGpuTimeEnd();CHKERRQ(ierr);
          ierr = PetscLogGpuFlops(2.0*a->nz);CHKERRQ(ierr);
          PetscFunctionReturn(0);
        2) In MatMult_MPIAIJCUSPARSE, use the following code sequence.
        The old code swapped the first two lines. Since with
        -log_view, MatMultAdd_SeqAIJCUSPARSE is blocking, I changed
    the
        order to have better overlap.
          ierr =
VecScatterBegin(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
          ierr = (*a->A->ops->mult)(a->A,xx,yy);CHKERRQ(ierr);
          ierr =
VecScatterEnd(a->Mvctx,xx,a->lvec,INSERT_VALUES,SCATTER_FORWARD);CHKERRQ(ierr);
          ierr =
    (*a->B->ops->multadd)(a->B,a->lvec,yy,yy);CHKERRQ(ierr);
        3) Log time directly in the test code so we can also know
        execution time without -log_view (hence cuda
    synchronization). I
        manually calculated the Total Mflop/s for these cases for easy
        comparison.

        <<Note the CPU versions are copied from yesterday's results>>
------------------------------------------------------------------------------------------------------------------------ Event Count Time (sec)Flop --- Global --- --- Stage
    ----  Total   GPU    -
        CpuToGpu -   - GpuToCpu - GPU
                           Max Ratio  Max     Ratio   Max  Ratio
     Mess      AvgLen  Reduct  %T %F %M %L %R  %T %F %M %L %R
    Mflop/s Mflop/s
        Count   Size   Count   Size  %F
---------------------------------------------------------------------------------------------------------------------------------------------------------------
        6 MPI ranks,
        MatMult              100 1.0 1.1895e+01 1.0 9.63e+09 1.1
    2.8e+03
2.2e+05 0.0e+00 24 99 97 18 0 100100100100 0 47430 0 0.00e+00 0 0.00e+00 0
        VecScatterBegin      100 1.0 4.9145e-02 3.0 0.00e+00 0.0
    2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 0 0100100 0 00 0 0.00e+00 0 0.00e+00 0
        VecScatterEnd        100 1.0 2.9441e+00 133 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 3 0 0 0 0 13 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        24 MPI ranks
        MatMult              100 1.0 3.1431e+00 1.0 2.63e+09 1.2
    1.9e+04
5.9e+04 0.0e+00 8 99 97 25 0 100100100100 0 179480 0 0.00e+00 0 0.00e+00 0
        VecScatterBegin      100 1.0 2.0583e-02 2.3 0.00e+00 0.0
    1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 0 0100100 0 00 0 0.00e+00 0 0.00e+00 0
        VecScatterEnd        100 1.0 1.0639e+0050.0 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 2 0 0 0 0 19 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        42 MPI ranks
        MatMult              100 1.0 2.0519e+00 1.0 1.52e+09 1.3
    3.5e+04
4.1e+04 0.0e+00 23 99 97 30 0 100100100100 0 274930 0 0.00e+00 0 0.00e+00 0
        VecScatterBegin      100 1.0 2.0971e-02 3.4 0.00e+00 0.0
    3.5e+04
4.1e+04 0.0e+00 0 0 97 30 0 1 0100100 0 00 0 0.00e+00 0 0.00e+00 0
        VecScatterEnd        100 1.0 8.5184e-0162.0 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 6 0 0 0 0 24 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        6 MPI ranks + 6 GPUs + regular SF + log_view
        MatMult              100 1.0 1.6863e-01 1.0 9.66e+09 1.1
    2.8e+03
2.2e+05 0.0e+00 0 99 97 18 0 100100100100 0 335743629278 100 1.02e+02 100 2.69e+02 100
        VecScatterBegin      100 1.0 5.0157e-02 1.6 0.00e+00 0.0
    2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 24 0100100 0 00 0 0.00e+00 100 2.69e+02 0
        VecScatterEnd        100 1.0 4.9155e-02 2.5 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 20 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        VecCUDACopyTo        100 1.0 9.5078e-03 2.0 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 4 0 0 0 0 00 100 1.02e+02 0 0.00e+00 0
        VecCopyFromSome      100 1.0 2.8485e-02 1.4 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 14 0 0 0 0 00 0 0.00e+00 100 2.69e+02 0
        6 MPI ranks + 6 GPUs + regular SF  + No log_view
        MatMult:             100
1.0 1.4180e-01 399268
        6 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
        MatMult              100 1.0 1.1053e-01 1.0 9.66e+09 1.1
    2.8e+03
2.2e+05 0.0e+00 1 99 97 18 0 100100100100 0 512224642075 0 0.00e+00 0 0.00e+00 100
        VecScatterBegin      100 1.0 8.3418e-03 1.5 0.00e+00 0.0
    2.8e+03
2.2e+05 0.0e+00 0 0 97 18 0 6 0100100 0 00 0 0.00e+00 0 0.00e+00 0
        VecScatterEnd        100 1.0 2.2619e-02 1.6 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 16 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        6 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
        MatMult:             100 1.0
9.8344e-02 575717
        24 MPI ranks + 6 GPUs + regular SF + log_view
        MatMult              100 1.0 1.1572e-01 1.0 2.63e+09 1.2
    1.9e+04
5.9e+04 0.0e+00 0 99 97 25 0 100100100100 0 489223708601 100 4.61e+01 100 6.72e+01 100
        VecScatterBegin      100 1.0 2.0641e-02 2.0 0.00e+00 0.0
    1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 13 0100100 0 00 0 0.00e+00 100 6.72e+01 0
        VecScatterEnd        100 1.0 6.8114e-02 5.6 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 38 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        VecCUDACopyTo        100 1.0 6.6646e-03 2.5 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 3 0 0 0 0 00 100 4.61e+01 0 0.00e+00 0
        VecCopyFromSome      100 1.0 1.0546e-02 1.7 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 0 0 0 0 0 7 0 0 0 0 00 0 0.00e+00 100 6.72e+01 0
        24 MPI ranks + 6 GPUs + regular SF + No log_view
        MatMult:             100 1.0
9.8254e-02 576201
        24 MPI ranks + 6 GPUs + CUDA-aware SF + log_view
        MatMult              100 1.0 1.1602e-01 1.0 2.63e+09 1.2
    1.9e+04
5.9e+04 0.0e+00 1 99 97 25 0 100100100100 0 487956707524 0 0.00e+00 0 0.00e+00 100
        VecScatterBegin      100 1.0 2.7088e-02 7.0 0.00e+00 0.0
    1.9e+04
5.9e+04 0.0e+00 0 0 97 25 0 8 0100100 0 00 0 0.00e+00 0 0.00e+00 0
        VecScatterEnd        100 1.0 8.4262e-02 3.0 0.00e+00 0.0
    0.0e+00
0.0e+00 0.0e+00 1 0 0 0 0 52 0 0 0 0 00 0 0.00e+00 0 0.00e+00 0
        24 MPI ranks + 6 GPUs + CUDA-aware SF + No log_view
        MatMult:             100 1.0
1.0397e-01 544510

Re: [petsc-dev] MatMult on Summit

Reply via email to