Matthew Knepley <[email protected]> writes: > On Thu, Jul 12, 2018 at 6:47 AM Lawrence Mitchell < > [email protected]> wrote: > >> Dear petsc-dev, >> >> we're starting to explore (with Andreas cc'd) residual assembly on >> GPUs. The question naturally arises: how to do GlobalToLocal and >> LocalToGlobal. >> > > There is not a lot of Mem Band difference between a GPU and a Skylake, but > I assume this is > to use hardware already purchased by some center.
Skylake Xeon is around 100 GB/s per socket, versus a V100 at about 750 GB/s. That's nothing to sneeze at, but moving the entire vector to the host just to pack messages is a much bigger hit for large subdomains because the entire volume needs to move over PCI-Express. >> I have: >> >> A PetscSF describing the communication pattern. >> >> A Vec holding the data to communicate. This will have an up-to-date >> device pointer. >> >> I would like: >> >> PetscSFBcastBegin/End (and ReduceBegin/End, etc...) to (optionally) >> work with raw device pointers. I am led to believe that modern MPIs >> can plug directly into device memory, so I would like to avoid copying >> data to the host, doing the communication there, and then going back >> up to the device. >> >> Given that I think that the window implementation (which just >> delegates the MPI for all the packing) is not considered prime time >> (mostly due to MPI implementation bugs, I think), I think this means >> implementing a version of PetscSF_Basic that can handle the >> pack/unpack directly on the device, and then just hands off to MPI. >> > > I think that is the case. I doubt GPU Direct can give high performance for the derived data types that the SF Window implementation uses (if it works at all). >> The next thing is how to put a higher-level interface on top of this. >> What, if any, suggestions are there for doing something where the >> top-level API is agnostic to whether the data are on the host or the >> device. >> >> We had thought something like: >> >> - Make PetscSF handle device pointers (possibly with new implementation?) >> >> - Make VecScatter use SF. >> > > Yep, this is what I would do. Agreed. >> Calling VecScatterBegin/End on a Vec with up-to-date device pointers >> just uses the SF directly. >> >> Have there been any thoughts about how you want to do multi-GPU >> interaction? With MPI-parallel code, I don't see a compelling reason to support multiple devices per MPI process. > I don't think so, but Karl could reply if there has been. > > How are you doing local assembly? > > Matt > > >> Cheers, >> >> Lawrence >> > > > -- > What most experimenters take for granted before they begin their > experiments is infinitely more interesting than any results to which their > experiments lead. > -- Norbert Wiener > > https://www.cse.buffalo.edu/~knepley/ <http://www.caam.rice.edu/~mk51/>
