Hi Freddie, Freddie Witherden <[email protected]> writes: > I have finally bitten the bullet and have started porting my solver from > CUDA to OpenCL. During a time-step it is necessary for MPI ranks to > exchange data. With PyCUDA and mpi4py our application proceeds as follows: > > At start-up we allocate a page-locked buffer on the host and an > equally-sized buffer on the device. We also construct a persistent MPI > request for either sending the host buffer. Then, when the time is > right, we run a packing kernel on the device, initiate a device-to-host > copy, and then start the persistent MPI request. > > Does anyone have any experience with performing this with OpenCL? From > what I can gather there are a variety of options, although none which > jump off the page. I am weary of getting the device to use a > memory-mapped host pointer (when I tried it with CUDA our performance > tanked). I can not also find a direct equivalent to pagelocked_empty in > OpenCL. ALLOC_HOST_PTR followed by an enqueue_map_buffer may be what I > want but am unsure if it fits in with persistent requests (it would need > to be mapped all of the time).
ALLOC_HOST_PTR with enqueue_map_buffer will give you page-locked memory (and thus fast transfers) on Nvidia and AMD. You very likely do *not* want to pass this buffer to any kernels though--just use it as a transfer target. Hope that helps, Andreas
pgpbRjLxPAbqL.pgp
Description: PGP signature
_______________________________________________ PyOpenCL mailing list [email protected] http://lists.tiker.net/listinfo/pyopencl
