I'm wondering about the details of Bcast implementation in OpenMPI. I'm specifically interested in IB interconnects, but information about other architectures (and OpenMPI in general) would also be very useful.

I am working with a code, which sends the sameĀ  (large) message to a bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange, but the message is the same for all neighbors. Since memory bandwidth is a scarce resource, I'd like to make sure we send the message with fewest possible memory accesses.

Hence the question: what does OpenMPI (and specifically for the IB case - the HPCX) do in such case? Does it get the buffer from memory O(1) times to send it to n peers, and the broadcast is orchestrated by the hardware? Or does it have to read the memory O(n) times? Is it more efficient to use Bcast, or is it the same as implementing the operation by n distinct send / put operations? Finally, is there any way to use the RMA put method with multiple targets, so that I only have to read the host memory once, and the switches / HCA take care of the rest?

Thanks a lot for any insights!


devel mailing list

Reply via email to