Re: [petsc-users] MemCpy (HtoD and DtoH) in Krylov solver

Karl Rupp via petsc-users Fri, 19 Jul 2019 09:08:22 -0700

Hi Xiangdong,

I can understand some of the numbers, but not the HtoD case.
In DtoH1, it is the data movement from VecMDot. The size of data is8.192KB, which is sizeof(PetscScalar) * MDOT_WORKGROUP_NUM * 8 = 8*128*8= 8192. My question is: instead of calling cublasDdot nv times, why doyou implement your own kernels? I guess it must be for performance, butcan you explain a little more?

Yes, this is a performance optimization. We've used several dot-products(suffers from kernel launch latency) as well as matrix-vector-products(suffers extra matrix setup) in the past; in both cases, there was extramemory traffic, thus impacting performance.

The reason why the data size is 8192 is to get around a separatereduction stage on the GPU (i.e. a second kernel launch). By moving thedata to the CPU and doing the reduction there, one is faster than doingit on the GPU and then moving only a few numbers. This has to do withPCI-Express latency: It takes about the same time to send a single byteas sending a few kilobytes. Only beyond ~10 KB the bandwidth becomes thelimiting factor.

In DtoH2, it is the data movement from VecNorm. The size of data is 8B,which is just the sizeof(PetscScalar).


This is most likely the result required for the control flow on the CPU.

In DtoD1, it is the data movement from VecAXPY. The size of data is17.952MB, which is exactly sizeof(PetscScalar)*length(b).

This is a vector assignment. If I remember correctly, it uses thememcpy-routines and hence shows up as a separate DtoD instead of just akernel. It matches the time required for scal_kernel_val (scaling avector by a scalar), so it runs at full bandwidth on the GPU.

However, I do not understand the number in HostToDevice in gmres fornp=1. The size of data movement is 1.032KB. I thought this is related tothe updated upper Hessenberg matrix, but the number does not match. Cananyone help me understand the data movement of HToD in GMRES for np=1?

1032 = (128+1)*8, so this might be some auxiliary work information onthe GPU. I could figure out the exact source of these transfers, butthat is some effort. Let me know whether this is important informationfor you, then I can do it.


Best regards,
Karli


Thank you.

Best,
Xiangdong

On Thu, Jul 18, 2019 at 1:14 PM Karl Rupp <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    as you can see from the screenshot, the communication is merely for
    scalars from the dot-products and/or norms. These are needed on the
    host
    for the control flow and convergence checks and is true for any
    iterative solver.

    Best regards,
    Karli



    On 7/18/19 3:11 PM, Xiangdong via petsc-users wrote:
     >
     >
     > On Thu, Jul 18, 2019 at 5:11 AM Smith, Barry F.
    <[email protected] <mailto:[email protected]>
     > <mailto:[email protected] <mailto:[email protected]>>> wrote:
     >
     >
     >         1) What preconditioner are you using? If any.
     >
     > Currently I am using none as I want to understand how gmres works
    on GPU.
     >
     >
     >         2) Where/how are you getting this information about the
     >     MemCpy(HtoD) and one call MemCpy(DtoH)? We might like to utilize
     >     this same sort of information to plan future optimizations.
     >
     > I am using nvprof and nvvp from cuda toolkit. It looks like there
    are
     > one MemCpy(HtoD) and three MemCpy(DtoH) calls per iteration for np=1
     > case. See the attached snapshots.
     >
     >         3) Are you using more than 1 MPI rank?
     >
     >
     > I tried both np=1 and np=2. Attached please find snapshots from
    nvvp for
     > both np=1 and np=2 cases. The figures showing gpu calls with two
    pure
     > gmres iterations.
     >
     > Thanks.
     > Xiangdong
     >
     >
     >        If you use the master branch (which we highly recommend for
     >     anyone using GPUs and PETSc) the -log_view option will log
     >     communication between CPU and GPU and display it in the summary
     >     table. This is useful for seeing exactly what operations are
    doing
     >     vector communication between the CPU/GPU.
     >
     >        We welcome all feedback on the GPUs since it previously
    has only
     >     been lightly used.
     >
     >         Barry
     >
     >
     >      > On Jul 16, 2019, at 9:05 PM, Xiangdong via petsc-users
     >     <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
    wrote:
     >      >
     >      > Hello everyone,
     >      >
     >      > I am new to petsc gpu and have a simple question.
     >      >
     >      > When I tried to solve Ax=b where A is MATAIJCUSPARSE and b
    and x
     >     are VECSEQCUDA  with GMRES(or GCR) and pcnone, I found that
    during
     >     each krylov iteration, there are one call MemCpy(HtoD) and
    one call
     >     MemCpy(DtoH). Does that mean the Krylov solve is not 100% on
    GPU and
     >     the solve still needs some work from CPU? What are these
    MemCpys for
     >     during the each iteration?
     >      >
     >      > Thank you.
     >      >
     >      > Best,
     >      > Xiangdong
     >

Re: [petsc-users] MemCpy (HtoD and DtoH) in Krylov solver

Reply via email to