Marcin,
Based on your description, you might want to investigate non blocking
collectives (e.g. MPI_Ibcast) or even upcoming persistent collectives
(e.g. MPIX_Bcast_init).
If you know the address of the receive buffer, then you can MPI_Ibcast()
on non root ranks very early, and then MPI_Test() or MPI_Wait() to wait
for completion.
And you will MPI_Ibcast() (followed by MPI_Wait()) on the root rank when
the data to be sent is available in the buffer.
Cheers,
Gilles
On 3/22/2019 5:14 PM, marcin.krotkiewski wrote:
On 3/21/19 5:31 PM, George Bosilca wrote:
I am not sure I understand your question, a bcast is a collective
operation that must be posted by all participants. Independently at
what level the bcast is serviced, if some of the participants have
not posted their participation to the collective, only partial
progress can be made.
Of course you're right, and all ranks will post the Bcast eventually,
but not necessarily at the same time. My concern is that, with my
understanding of the tree algorithm, if a non-leaf tree node is busy
(by e.g., computations) at the time the node upstream tries to send it
a message, then it will take time until it can propagate the message
downstream to it's sub-tree. During that time the sub-tree nodes may
be done computing and idle, waiting for the Bcasted message. So in a
realistic scenario, in which not all the nodes participate in a Bcast
at the same time, a tree algorithm might not be perfect. Am I wrong?
I guess this problem can be somehow mitigated by a 'smart' algorithm,
which would identify such stalls and re-send the message to the
sub-tree in question from another node. However, it seems that in a
pessimistic scenario this degrades to a simple one-sends-to-all
implementation. And my main reasons for asking is minimizing of the
memory bandwidth overhead connecting with sending of the message.
Marcin
George.
On Thu, Mar 21, 2019 at 12:24 PM Joshua Ladd <jladd.m...@gmail.com
<mailto:jladd.m...@gmail.com>> wrote:
Marcin,
HPC-X implements the MPI BCAST operation by leveraging hardware
multicast capabilities. Starting with HPC-X v2.3 we introduced a
new multicast based algorithm for large messages as well.
Hardware multicast scales as O(1) modulo switch hops. It is the
most efficient way to broadcast a message in an IB network.
Hope this helps.
Best,
Josh
On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski
<marcin.krotkiew...@gmail.com
<mailto:marcin.krotkiew...@gmail.com>> wrote:
Thanks, George! So, the function you mentioned is used when I
turn off HCOLL and use OpenMPI's tuned coll instead. That
helps a lot. Another thing that makes me think is that in my
case the data is sent to the targets asynchronously, or
rather - it is a 'put' operation in nature, and the targets
don't know, when the data is ready. I guess the tree
algorithms you mentioned require active participation of all
nodes, otherwise the algorithm will not progress? Is it
enough to call any MPI routine to assure progression, or do I
have to call the matching Bcast?
Anyone from Mellanox here, who knows how HCOLL does this
internally? Especially on the EDR architecture. Is there any
hardware aid?
Thanks!
Marcin
On 3/20/19 5:10 PM, George Bosilca wrote:
If you have support for FCA then it might happen that the
collective will use the hardware support. In any case, most
of the bcast algorithms have a logarithmic behavior, so
there will be at most O(log(P)) memory accesses on the root.
If you want to take a look at the code in OMPI to understand
what function is called in your specific case head
to ompi/mca/coll/tuned/ and search for the
ompi_coll_tuned_bcast_intra_dec_fixed function
in coll_tuned_decision_fixed.c.
George.
On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski
<marcin.krotkiew...@gmail.com
<mailto:marcin.krotkiew...@gmail.com>> wrote:
Hi!
I'm wondering about the details of Bcast implementation
in OpenMPI. I'm
specifically interested in IB interconnects, but
information about other
architectures (and OpenMPI in general) would also be
very useful.
I am working with a code, which sends the same (large)
message to a
bunch of 'neighboring' processes. Somewhat like a
ghost-zone exchange,
but the message is the same for all neighbors. Since
memory bandwidth is
a scarce resource, I'd like to make sure we send the
message with fewest
possible memory accesses.
Hence the question: what does OpenMPI (and specifically
for the IB case
- the HPCX) do in such case? Does it get the buffer from
memory O(1)
times to send it to n peers, and the broadcast is
orchestrated by the
hardware? Or does it have to read the memory O(n) times?
Is it more
efficient to use Bcast, or is it the same as
implementing the operation
by n distinct send / put operations? Finally, is there
any way to use
the RMA put method with multiple targets, so that I only
have to read
the host memory once, and the switches / HCA take care
of the rest?
Thanks a lot for any insights!
Marcin
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org <mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel