Hi John,

I read your article, and indeed this seems to be very close to what I
am looking for. This `rma_object<>` will certainly prove very useful in
implementing zero-copy data transport.

However, in the context of heterogeneous clusters, it might be
necessary to use different allocators on both ends, since CUDA seems
not to be using regular page-locked memory [1]. But this could in
principle be done by the user by implementing some "heterogeneous
pinned allocator", which adjusts its behaviour depending on whether the
node has GPUs or not.

For my use case, I do not even plan to schedule actions directly on GPU
targets, but only on their host, which would then be responsible for
launching the kernels. Only the RMA from the sender to the remote
device memory would need to be handled by HPX.

Anyway, thanks a lot for your reply and keep up the good work ! It
might be a bit too late for my current project but it will certainly be
of great interest to my coworkers. In the meantime, I will try to
implement Hartmut's suggestion of using `serialize_buffer` to limit the
number of copies.

BTW, are you sure about the units on fig. 3 ? 4 seconds to serialize
20KB of data in not especially High Performance... unless of course you
were running HPX on a toaster :-)


[1] https://stackoverflow.com/questions/26888890/cuda-and-pinned-page-l
hpx-users mailing list

Reply via email to