At init time, each task invoke btl_openib_component_init() which invokes btl_openib_modex_send() basically, it collects infiniband info (port, subnet, lid, ...) and "push" them to orted via the modex mechanism.

When a communication is created, the remote information is retrieved via the modex mechanism in mca_btl_openib_proc_get_locket()



On 4/8/2016 1:30 PM, dpchoudh . wrote:
Hi Gilles

Thanks for responding quickly; however, I am afraid I did not explain my question clearly enough; my apologies.

What I am trying to understand is this:

My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job launch and control traffic); this is not used for MPI messaging. Let's say that the IP addresses are They are all in the same IP subnet.

The MPI messaging is used using some other interconnects, such as Infiniband. All 7 nodes are connected to the same Infiniband switch and hence are in the same (infiniband) subnet as well.

In my host file, I mention (say) 4 IP addresses: 192.168.3-

My question is, how does OpenMPI pick the 4 Infiniband interfaces that matches the IP addresses? Put another way, the ranks of each launched jobs are (I presume) setup by orted by some mechanism. When I do an MPI_Send() to a given rank, the message goes to the Infiniband interface with a particular LID. How does this IP-to-Infiniband LID mapping happen?


On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet wrote:


    the hostnames (or their IPs) are only used to ssh orted.

    if you use only the tcp btl :

    TCP *MPI* communications (vs OOB management communications) are
    handled by btl/tcp
    by default, all usable interfaces are used, then messages are
    split (iirc, by ob1 pml) and then "fragments"
    are sent using all interfaces.

    each interface has a latency and bandwidth that is used to split
    message into fragments.
    (assuming it is correctly configured, 90% of a large message is
    sent over the 10GbE interface, and 10% is sent over the GbE interface)

    if you can explicitly list/blacklist interface
    mpirun --mca btl_tcp_if_include ...
    mpirun --mca btl_tcp_if_exclude ...

    (see ompi_info --all for the syntax)

    but if you use several btls (for example tcp and openib), the
    btl(s) with the lower exclusivity are not used.
    (for example, a large message is *not* split and send using native
    ib, IPoIB and GbE because the openib btl
    has a higher exclusivity than the tcp btl)

    did this answer your question ?



    On 4/8/2016 12:24 PM, dpchoudh . wrote:
    Hello all

    (Newbie warning! Sorry :-(  )

    Let's say my cluster has 7 nodes, connected via IP-over-Ethernet
    for control traffic and some kind of raw verbs (or anything else
    such as SRIO) interface for data transfer. Let's say my host file
    chooses 4 out of the 7 nodes for an MPI job, based on the IP
    address, which are assigned to the Ethernet interfaces.

    My question is: where in the code does this mapping between
    IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined,
    such as only those chosen nodes receive traffic over the verbs

    Thanks in advance

