At init time, each task invoke btl_openib_component_init() which invokes
btl_openib_modex_send()
basically, it collects infiniband info (port, subnet, lid, ...) and
"push" them to orted via the modex mechanism.
When a communication is created, the remote information is retrieved via
the modex mechanism in mca_btl_openib_proc_get_locket()
Cheers,
Gilles
On 4/8/2016 1:30 PM, dpchoudh . wrote:
Hi Gilles
Thanks for responding quickly; however, I am afraid I did not explain
my question clearly enough; my apologies.
What I am trying to understand is this:
My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for
job launch and control traffic); this is not used for MPI messaging.
Let's say that the IP addresses are 192.168.1.2-192.168.1.9. They are
all in the same IP subnet.
The MPI messaging is used using some other interconnects, such as
Infiniband. All 7 nodes are connected to the same Infiniband switch
and hence are in the same (infiniband) subnet as well.
In my host file, I mention (say) 4 IP addresses: 192.168.3-192.168.1.7
My question is, how does OpenMPI pick the 4 Infiniband interfaces that
matches the IP addresses? Put another way, the ranks of each launched
jobs are (I presume) setup by orted by some mechanism. When I do an
MPI_Send() to a given rank, the message goes to the Infiniband
interface with a particular LID. How does this IP-to-Infiniband LID
mapping happen?
Thanks
Durga
We learn from history that we never learn from history.
On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet
<gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:
Hi,
the hostnames (or their IPs) are only used to ssh orted.
if you use only the tcp btl :
TCP *MPI* communications (vs OOB management communications) are
handled by btl/tcp
by default, all usable interfaces are used, then messages are
split (iirc, by ob1 pml) and then "fragments"
are sent using all interfaces.
each interface has a latency and bandwidth that is used to split
message into fragments.
(assuming it is correctly configured, 90% of a large message is
sent over the 10GbE interface, and 10% is sent over the GbE interface)
if you can explicitly list/blacklist interface
mpirun --mca btl_tcp_if_include ...
or
mpirun --mca btl_tcp_if_exclude ...
(see ompi_info --all for the syntax)
but if you use several btls (for example tcp and openib), the
btl(s) with the lower exclusivity are not used.
(for example, a large message is *not* split and send using native
ib, IPoIB and GbE because the openib btl
has a higher exclusivity than the tcp btl)
did this answer your question ?
Cheers,
Gilles
On 4/8/2016 12:24 PM, dpchoudh . wrote:
Hello all
(Newbie warning! Sorry :-( )
Let's say my cluster has 7 nodes, connected via IP-over-Ethernet
for control traffic and some kind of raw verbs (or anything else
such as SRIO) interface for data transfer. Let's say my host file
chooses 4 out of the 7 nodes for an MPI job, based on the IP
address, which are assigned to the Ethernet interfaces.
My question is: where in the code does this mapping between
IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined,
such as only those chosen nodes receive traffic over the verbs
interface?
Thanks in advance
Durga
We learn from history that we never learn from history.
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this
post:http://www.open-mpi.org/community/lists/devel/2016/04/18746.php
_______________________________________________
devel mailing list
de...@open-mpi.org <mailto:de...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18747.php
_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post:
http://www.open-mpi.org/community/lists/devel/2016/04/18748.php