Hi,
One more comment. Regarding the IB Mcast and its usage. The key advantage of IB 
Mcast (enabled with hpcx+hcoll when user utilizes MPI_Bcast collective call) is 
nearly constant scaling. So, it gives most advantage when many multiple nodes 
are used at the same time for a collective. By default, hcoll will use mcast 
based algorithm when the MPI communicator used to launch MPI_Bcast spans at 
least 8 nodes (does not matter how many ranks-per-node in this case). As far as 
I understood your case, you have a Neighbor-Exchange pattern, i.e. limited 
number of neighbors (usually 4 in 2D regular grid, or 6 in 3D) that you need to 
send the data to. If this is the case then your application most likely has 
many multiple small (5,7 ranks) communicators. IB mcast is unlikely to give a 
lot of advantage in this case (scale is small). Moreover, if you run multiple 
ranks per node then a lot of communication happens entirely innode where the 
data goes through shared memory w/o IB.

Finally, if you want to do RDMA, it is possible. You could do MPI_Put/get. 
Alternative is to use open_shmem which could be a good fit for that case (this 
is the standard that is designed for one-sided put/get semantics) but this will 
require some study if you are not familiar with it.

From: devel <devel-boun...@lists.open-mpi.org> On Behalf Of George Bosilca
Sent: Thursday, March 21, 2019 7:31 PM
To: Open MPI Developers <devel@lists.open-mpi.org>
Subject: Re: [OMPI devel] Memory performance with Bcast

Marcin,

I am not sure I understand your question, a bcast is a collective operation 
that must be posted by all participants. Independently at what level the bcast 
is serviced, if some of the participants have not posted their participation to 
the collective, only partial progress can be made.

  George.


On Thu, Mar 21, 2019 at 12:24 PM Joshua Ladd 
<jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:
Marcin,

HPC-X implements the MPI BCAST operation by leveraging hardware multicast 
capabilities. Starting with HPC-X v2.3 we introduced a new multicast based 
algorithm for large messages as well. Hardware multicast scales as O(1) modulo 
switch hops. It is the most efficient way to broadcast a message in an IB 
network.

Hope this helps.

Best,

Josh



On Thu, Mar 21, 2019 at 5:01 AM marcin.krotkiewski 
<marcin.krotkiew...@gmail.com<mailto:marcin.krotkiew...@gmail.com>> wrote:

Thanks, George! So, the function you mentioned is used when I turn off HCOLL 
and use OpenMPI's tuned coll instead. That helps a lot. Another thing that 
makes me think is that in my case the data is sent to the targets 
asynchronously, or rather - it is a 'put' operation in nature, and the targets 
don't know, when the data is ready. I guess the tree algorithms you mentioned 
require active participation of all nodes, otherwise the algorithm will not 
progress? Is it enough to call any MPI routine to assure progression, or do I 
have to call the matching Bcast?

Anyone from Mellanox here, who knows how HCOLL does this internally? Especially 
on the EDR architecture. Is there any hardware aid?

Thanks!

Marcin


On 3/20/19 5:10 PM, George Bosilca wrote:
If you have support for FCA then it might happen that the collective will use 
the hardware support. In any case, most of the bcast algorithms have a 
logarithmic behavior, so there will be at most O(log(P)) memory accesses on the 
root.

If you want to take a look at the code in OMPI to understand what function is 
called in your specific case head to ompi/mca/coll/tuned/ and search for the 
ompi_coll_tuned_bcast_intra_dec_fixed function in coll_tuned_decision_fixed.c.

  George.


On Wed, Mar 20, 2019 at 4:53 AM marcin.krotkiewski 
<marcin.krotkiew...@gmail.com<mailto:marcin.krotkiew...@gmail.com>> wrote:
Hi!

I'm wondering about the details of Bcast implementation in OpenMPI. I'm
specifically interested in IB interconnects, but information about other
architectures (and OpenMPI in general) would also be very useful.

I am working with a code, which sends the same  (large) message to a
bunch of 'neighboring' processes. Somewhat like a ghost-zone exchange,
but the message is the same for all neighbors. Since memory bandwidth is
a scarce resource, I'd like to make sure we send the message with fewest
possible memory accesses.

Hence the question: what does OpenMPI (and specifically for the IB case
- the HPCX) do in such case? Does it get the buffer from memory O(1)
times to send it to n peers, and the broadcast is orchestrated by the
hardware? Or does it have to read the memory O(n) times? Is it more
efficient to use Bcast, or is it the same as implementing the operation
by n distinct send / put operations? Finally, is there any way to use
the RMA put method with multiple targets, so that I only have to read
the host memory once, and the switches / HCA take care of the rest?

Thanks a lot for any insights!

Marcin


_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476435790&sdata=htYMlrS%2FpIa6w0P%2BSE8dqedDs8meJ7OhoLwlOJiBdlw%3D&reserved=0>


_______________________________________________

devel mailing list

devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>

https://lists.open-mpi.org/mailman/listinfo/devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476435790&sdata=htYMlrS%2FpIa6w0P%2BSE8dqedDs8meJ7OhoLwlOJiBdlw%3D&reserved=0>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476445795&sdata=5torvJGudvMLEg%2BdoYQnnMlsBYM4uq%2B0tDVNFDWbEm8%3D&reserved=0>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org<mailto:devel@lists.open-mpi.org>
https://lists.open-mpi.org/mailman/listinfo/devel<https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.open-mpi.org%2Fmailman%2Flistinfo%2Fdevel&data=02%7C01%7Cvalentinp%40mellanox.com%7Ccbe416b3e84c449a208b08d6ae1acbf7%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C636887827476445795&sdata=5torvJGudvMLEg%2BdoYQnnMlsBYM4uq%2B0tDVNFDWbEm8%3D&reserved=0>
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://lists.open-mpi.org/mailman/listinfo/devel

Reply via email to