At init time, each task invoke btl_openib_component_init() which invokes btl_openib_modex_send() basically, it collects infiniband info (port, subnet, lid, ...) and "push" them to orted via the modex mechanism.

When a communication is created, the remote information is retrieved via the modex mechanism in mca_btl_openib_proc_get_locket()

Cheers,

Gilles

On 4/8/2016 1:30 PM, dpchoudh . wrote:
Hi Gilles

Thanks for responding quickly; however, I am afraid I did not explain my question clearly enough; my apologies.

What I am trying to understand is this:

My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job launch and control traffic); this is not used for MPI messaging. Let's say that the IP addresses are 192.168.1.2-192.168.1.9. They are all in the same IP subnet.

The MPI messaging is used using some other interconnects, such as Infiniband. All 7 nodes are connected to the same Infiniband switch and hence are in the same (infiniband) subnet as well.

In my host file, I mention (say) 4 IP addresses: 192.168.3-192.168.1.7

My question is, how does OpenMPI pick the 4 Infiniband interfaces that matches the IP addresses? Put another way, the ranks of each launched jobs are (I presume) setup by orted by some mechanism. When I do an MPI_Send() to a given rank, the message goes to the Infiniband interface with a particular LID. How does this IP-to-Infiniband LID mapping happen?

Thanks
Durga

We learn from history that we never learn from history.

On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet <gil...@rist.or.jp <mailto:gil...@rist.or.jp>> wrote:

    Hi,

    the hostnames (or their IPs) are only used to ssh orted.


    if you use only the tcp btl :

    TCP *MPI* communications (vs OOB management communications) are
    handled by btl/tcp
    by default, all usable interfaces are used, then messages are
    split (iirc, by ob1 pml) and then "fragments"
    are sent using all interfaces.

    each interface has a latency and bandwidth that is used to split
    message into fragments.
    (assuming it is correctly configured, 90% of a large message is
    sent over the 10GbE interface, and 10% is sent over the GbE interface)

    if you can explicitly list/blacklist interface
    mpirun --mca btl_tcp_if_include ...
    or
    mpirun --mca btl_tcp_if_exclude ...

    (see ompi_info --all for the syntax)


    but if you use several btls (for example tcp and openib), the
    btl(s) with the lower exclusivity are not used.
    (for example, a large message is *not* split and send using native
    ib, IPoIB and GbE because the openib btl
    has a higher exclusivity than the tcp btl)


    did this answer your question ?

    Cheers,

    Gilles



    On 4/8/2016 12:24 PM, dpchoudh . wrote:
    Hello all

    (Newbie warning! Sorry :-(  )

    Let's say my cluster has 7 nodes, connected via IP-over-Ethernet
    for control traffic and some kind of raw verbs (or anything else
    such as SRIO) interface for data transfer. Let's say my host file
    chooses 4 out of the 7 nodes for an MPI job, based on the IP
    address, which are assigned to the Ethernet interfaces.

    My question is: where in the code does this mapping between
    IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined,
    such as only those chosen nodes receive traffic over the verbs
    interface?

    Thanks in advance
    Durga

    We learn from history that we never learn from history.


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription:http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this 
post:http://www.open-mpi.org/community/lists/devel/2016/04/18746.php


    _______________________________________________
    devel mailing list
    de...@open-mpi.org <mailto:de...@open-mpi.org>
    Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
    Link to this post:
    http://www.open-mpi.org/community/lists/devel/2016/04/18747.php




_______________________________________________
devel mailing list
de...@open-mpi.org
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
Link to this post: 
http://www.open-mpi.org/community/lists/devel/2016/04/18748.php

Reply via email to