Jean-Loup This is something I've been working towards, but unfortunately the code is not quite ready for you yet. I have begun work on rma_object<T> types that make use of a custom allocator to make use of pinned memory and so far I have integrated it into the libfabric parcelport, but I have not finished with it yet. The current status is described in this paper ftp://ftp.cscs.ch/out/biddisco/hpx/Applied-Computing-HPX-ZeroCopy.pdf and I think when you see the description, you'll want to use it for your gpu data.
in principle, it ought to be fairly straightforward to make it work, but in practice it will require quite a bit of poking around in the hpx internals to get it going. If you are not desperate and can wait a few months, then I will be resuming this work in september with extensions to the rma_object types so that you can perform put/get operations on remote nodes with them directly, rather than invoking a 'copy' action. For the gpu, we would map the rma.put/get onto a cuda copy operation using the pinned memory of the underlying object. overloading the rma_object types to use a different allocator - taken from the gpu - would actually be fairly easy, but I have not looked at the action handling for gpu targets, so I'd need to ponder that. Bsaically - the answer to your question is - there will be a way to do what you want - but not yet. JB ________________________________________ From: [email protected] [[email protected]] on behalf of Jean-Loup Tastet [[email protected]] Sent: 07 August 2017 14:54 To: [email protected] Cc: Felice Pantaleo Subject: [hpx-users] Receiving action arguments on pinned memory Hi all, I am currently trying to use HPX to offload computationally intensive tasks to remote GPU nodes. In idiomatic HPX, this would typically be done by invoking a remote action: OutputData compute(InputData input_data) { /* Asynchronously copy `input_data` to device using DMA */ /* Do work on GPU */ /* Copy back the results to host */ return results; } HPX_PLAIN_ACTION(compute, compute_action); // In sender code auto fut = hpx::async(compute_action(), remote_locality_with_gpu, std::move(input_data)); So far, so good. However, an important requirement is that the memory allocated for the input data on the receiver end be pinned, to enable asynchronous copy between the host and the GPU. This can of course always be done by copying the argument `input_data` to pinned memory within the function body, but I would prefer to avoid any superfluous copy in order to minimize the overhead. Do you know if it is possible to control within HPX where the memory for the input data will be allocated (on the receiver end) ? I tried to use the `pinned_allocator` from the Thrust library for the data members of `InputData`, and although it did its job as expected, it also requires to allocate pinned memory on the sender side (for the construction of the object), as well as the presence of the Thrust library and the CUDA runtime on both machines. This led me to think that there should be a better way. Ideally, I would be able to directly deserialize the incoming data into pinned memory. Do you know if there is a way to do this or something similar in HPX ? If not, do you think it is possible to emulate such functionality by directly using the low-level constructs / internals of HPX ? This is for a prototype, so it is okay to use unstable / undocumented code as long as it allows me to prove the feasibility of the approach. I would greatly appreciate any input / suggestions on how to approach this issue. If anyone has experience using HPX with GPUs or on heterogeneous clusters, I would be very interested in hearing about it as well. Best regards, Jean-Loup Tastet _______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users _______________________________________________ hpx-users mailing list [email protected] https://mail.cct.lsu.edu/mailman/listinfo/hpx-users
