Re: [petsc-dev] PetscSF and/or VecScatter with device pointers

Karl Rupp Sat, 14 Jul 2018 15:08:46 -0700

Hi,

we're starting to explore (with Andreas cc'd) residual assembly on
GPUs.  The question naturally arises: how to do GlobalToLocal and
LocalToGlobal.

I have:

A PetscSF describing the communication pattern.

A Vec holding the data to communicate.  This will have an up-to-date
device pointer.

I would like:

PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally)
work with raw device pointers.  I am led to believe that modern MPIs
can plug directly into device memory, so I would like to avoid copying
data to the host, doing the communication there, and then going back
up to the device.

I don't know how the CUDA software stack has advanced recently, butusually you want to try your best at avoiding any latency hits due toPCI Express. That is, packing the ghost data you want to communicate (asdescribed by the SF) on the GPU, sending the packed data over, thenunpacking on the host (note: here one could further optimize if needed)will most likely be much better in terms of latency and efficient use oflow PCI-Express bandwidth than what Unified Memory approaches can provide.


If you want to use OpenCL, you'll have to do the above anyway.

Given that I think that the window implementation (which just
delegates the MPI for all the packing) is not considered prime time
(mostly due to MPI implementation bugs, I think), I think this means
implementing a version of PetscSF_Basic that can handle the
pack/unpack directly on the device, and then just hands off to MPI.

The next thing is how to put a higher-level interface on top of this.
What, if any, suggestions are there for doing something where the
top-level API is agnostic to whether the data are on the host or the
device.

We had thought something like:

- Make PetscSF handle device pointers (possibly with new implementation?)

- Make VecScatter use SF.

Calling VecScatterBegin/End on a Vec with up-to-date device pointers
just uses the SF directly.

There are already optimizations for VecScatter when using CUDA availablealready. I'm happy to help you with tweaking that to SF within the nextweek if needed.

Have there been any thoughts about how you want to do multi-GPU
interaction?


Just use MPI with one GPU per MPI rank :-)

Best regards,
Karli

Re: [petsc-dev] PetscSF and/or VecScatter with device pointers

Reply via email to