Hi John, I read your article, and indeed this seems to be very close to what I am looking for. This `rma_object<>` will certainly prove very useful in implementing zero-copy data transport.
However, in the context of heterogeneous clusters, it might be necessary to use different allocators on both ends, since CUDA seems not to be using regular page-locked memory [1]. But this could in principle be done by the user by implementing some "heterogeneous pinned allocator", which adjusts its behaviour depending on whether the node has GPUs or not. For my use case, I do not even plan to schedule actions directly on GPU targets, but only on their host, which would then be responsible for launching the kernels. Only the RMA from the sender to the remote device memory would need to be handled by HPX. Anyway, thanks a lot for your reply and keep up the good work ! It might be a bit too late for my current project but it will certainly be of great interest to my coworkers. In the meantime, I will try to implement Hartmut's suggestion of using `serialize_buffer` to limit the number of copies. BTW, are you sure about the units on fig. 3 ? 4 seconds to serialize 20KB of data in not especially High Performance... unless of course you were running HPX on a toaster :-) Cheers, Jean-Loup [1] https://stackoverflow.com/questions/26888890/cuda-and-pinned-page-l ocked-memory-not-page-locked-at-all?rq=1 _______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
