I'm wondering about the details of Bcast implementation in OpenMPI. I'm
specifically interested in IB interconnects, but information about other
architectures (and OpenMPI in general) would also be very useful.
I am working with a code, which sends the same (large) message to a
bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange,
but the message is the same for all neighbors. Since memory bandwidth is
a scarce resource, I'd like to make sure we send the message with fewest
possible memory accesses.
Hence the question: what does OpenMPI (and specifically for the IB case
- the HPCX) do in such case? Does it get the buffer from memory O(1)
times to send it to n peers, and the broadcast is orchestrated by the
hardware? Or does it have to read the memory O(n) times? Is it more
efficient to use Bcast, or is it the same as implementing the operation
by n distinct send / put operations? Finally, is there any way to use
the RMA put method with multiple targets, so that I only have to read
the host memory once, and the switches / HCA take care of the rest?
Thanks a lot for any insights!
devel mailing list