> On 12 Jul 2018, at 22:08, Jed Brown <[email protected]> wrote: > ... >>> I have: >>> >>> A PetscSF describing the communication pattern. >>> >>> A Vec holding the data to communicate. This will have an up-to-date >>> device pointer. >>> >>> I would like: >>> >>> PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally) >>> work with raw device pointers. I am led to believe that modern MPIs >>> can plug directly into device memory, so I would like to avoid copying >>> data to the host, doing the communication there, and then going back >>> up to the device. >>> >>> Given that I think that the window implementation (which just >>> delegates the MPI for all the packing) is not considered prime time >>> (mostly due to MPI implementation bugs, I think), I think this means >>> implementing a version of PetscSF_Basic that can handle the >>> pack/unpack directly on the device, and then just hands off to MPI. >>> >> >> I think that is the case. > > I doubt GPU Direct can give high performance for the derived data types > that the SF Window implementation uses (if it works at all).
MVAPICH claims to support datatypes with GPUDirect (including non-contiguous), and one-sided DMA. But I'm willing to believe that this is all lies. >>> The next thing is how to put a higher-level interface on top of this. >>> What, if any, suggestions are there for doing something where the >>> top-level API is agnostic to whether the data are on the host or the >>> device. >>> >>> We had thought something like: >>> >>> - Make PetscSF handle device pointers (possibly with new implementation?) >>> >>> - Make VecScatter use SF. >> Yep, this is what I would do. > > Agreed. OK. We'll have a look at getting this done. >>> Calling VecScatterBegin/End on a Vec with up-to-date device pointers >>> just uses the SF directly. >>> >>> Have there been any thoughts about how you want to do multi-GPU >>> interaction? > > With MPI-parallel code, I don't see a compelling reason to support > multiple devices per MPI process. Miscommunication: by multi-GPU, I mean one device per MPI process. I just meant, if there is existing PETSc effort going towards supporting computation on device, are there thoughts above and beyond what I just described on how you want to hide device-device transfers behind the API. Cheers, Lawrence
