Hi,
we're starting to explore (with Andreas cc'd) residual assembly on GPUs. The question naturally arises: how to do GlobalToLocal and LocalToGlobal. I have: A PetscSF describing the communication pattern. A Vec holding the data to communicate. This will have an up-to-date device pointer. I would like: PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally) work with raw device pointers. I am led to believe that modern MPIs can plug directly into device memory, so I would like to avoid copying data to the host, doing the communication there, and then going back up to the device.
I don't know how the CUDA software stack has advanced recently, but usually you want to try your best at avoiding any latency hits due to PCI Express. That is, packing the ghost data you want to communicate (as described by the SF) on the GPU, sending the packed data over, then unpacking on the host (note: here one could further optimize if needed) will most likely be much better in terms of latency and efficient use of low PCI-Express bandwidth than what Unified Memory approaches can provide.
If you want to use OpenCL, you'll have to do the above anyway.
Given that I think that the window implementation (which just delegates the MPI for all the packing) is not considered prime time (mostly due to MPI implementation bugs, I think), I think this means implementing a version of PetscSF_Basic that can handle the pack/unpack directly on the device, and then just hands off to MPI. The next thing is how to put a higher-level interface on top of this. What, if any, suggestions are there for doing something where the top-level API is agnostic to whether the data are on the host or the device. We had thought something like: - Make PetscSF handle device pointers (possibly with new implementation?) - Make VecScatter use SF. Calling VecScatterBegin/End on a Vec with up-to-date device pointers just uses the SF directly.
There are already optimizations for VecScatter when using CUDA available already. I'm happy to help you with tweaking that to SF within the next week if needed.
Have there been any thoughts about how you want to do multi-GPU interaction?
Just use MPI with one GPU per MPI rank :-) Best regards, Karli
