To be clear: what Gilles initially said is correct -- the hostnames/IP 
addresses that you give to mpirun are only used to specify on which machines to 
launch.  They are not used for any indication of which networks to use for MPI 
communication.

In general, BTLs and MTLs are queried during MPI_INIT and figure out what 
networks are available.  The next steps have changed over the years/different 
versions of Open MPI, but notionally it's more or less like this:

- They all "publish" this info to a (notionally) global data store.
- Each process X checks whether it can communicate with peer process Y: it 
retrieves Y's BTL/MTL info from the global data store
- X effectively gives this addressing info to its BTLs/MTLs and says "can you 
connect to any of these?"
  --> MCA params such as btl_NAME_if_include and btl_NAME_if_exclude are 
factored in here
- X builds up a matrix/list of BTLs/MTLs with which it can use to communicate 
with each process Y
- If X finds a process with which it cannot communicate, it errors out

Again, the details of the above procedure vary from version to version, but 
that's the general idea: each process investigates its local networking setup 
and compares it to peer process' local networking setup to look for 
connectivity paths.



> On Apr 8, 2016, at 12:54 AM, dpchoudh . <dpcho...@gmail.com> wrote:
> 
> Thank you very much, Gilles. That is exactly the information I was looking 
> for.
> 
> Best regards
> Durga
> 
> We learn from history that we never learn from history.
> 
> On Fri, Apr 8, 2016 at 12:52 AM, Gilles Gouaillardet <gil...@rist.or.jp> 
> wrote:
> At init time, each task invoke btl_openib_component_init() which invokes 
> btl_openib_modex_send()
> basically, it collects infiniband info (port, subnet, lid, ...) and "push" 
> them to orted via the modex mechanism.
> 
> When a communication is created, the remote information is retrieved via the 
> modex mechanism in mca_btl_openib_proc_get_locket()
> 
> Cheers,
> 
> Gilles
> 
> 
> On 4/8/2016 1:30 PM, dpchoudh . wrote:
>> Hi Gilles
>> 
>> Thanks for responding quickly; however, I am afraid I did not explain my 
>> question clearly enough; my apologies.
>> 
>> What I am trying to understand is this:
>> 
>> My cluster has (say) 7 nodes. I use IP-over-Ethernet for Orted (for job 
>> launch and control traffic); this is not used for MPI messaging. Let's say 
>> that the IP addresses are 192.168.1.2-192.168.1.9. They are all in the same 
>> IP subnet.
>> 
>> The MPI messaging is used using some other interconnects, such as 
>> Infiniband. All 7 nodes are connected to the same Infiniband switch and 
>> hence are in the same (infiniband) subnet as well.
>> 
>> In my host file, I mention (say) 4 IP addresses:  192.168.3-192.168.1.7
>> 
>> My question is, how does OpenMPI pick the 4 Infiniband interfaces that 
>> matches the IP addresses? Put another way, the ranks of each launched jobs 
>> are (I presume) setup by orted by some mechanism. When I do an MPI_Send() to 
>> a given rank, the message goes to the Infiniband interface with a particular 
>> LID. How does this IP-to-Infiniband LID mapping happen?
>> 
>> Thanks
>> Durga
>> 
>> We learn from history that we never learn from history.
>> 
>> On Fri, Apr 8, 2016 at 12:12 AM, Gilles Gouaillardet <gil...@rist.or.jp> 
>> wrote:
>> Hi,
>> 
>> the hostnames (or their IPs) are only used to ssh orted.
>> 
>> 
>> if you use only the tcp btl :
>> 
>> TCP *MPI* communications (vs OOB management communications) are handled by 
>> btl/tcp
>> by default, all usable interfaces are used, then messages are split (iirc, 
>> by ob1 pml) and then "fragments"
>> are sent using all interfaces.
>> 
>> each interface has a latency and bandwidth that is used to split message 
>> into fragments.
>> (assuming it is correctly configured, 90% of a large message is sent over 
>> the 10GbE interface, and 10% is sent over the GbE interface)
>> 
>> if you can explicitly list/blacklist interface
>> mpirun --mca btl_tcp_if_include ...
>> or
>> mpirun --mca btl_tcp_if_exclude ...
>> 
>> (see ompi_info --all for the syntax)
>> 
>> 
>> but if you use several btls (for example tcp and openib), the btl(s) with 
>> the lower exclusivity are not used.
>> (for example, a large message is *not* split and send using native ib, IPoIB 
>> and GbE because the openib btl
>> has a higher exclusivity than the tcp btl)
>> 
>> 
>> did this answer your question ?
>> 
>> Cheers,
>> 
>> Gilles
>> 
>> 
>> 
>> On 4/8/2016 12:24 PM, dpchoudh . wrote:
>>> Hello all
>>> 
>>> (Newbie warning! Sorry :-(  )
>>> 
>>> Let's say my cluster has 7 nodes, connected via IP-over-Ethernet for 
>>> control traffic and some kind of raw verbs (or anything else such as SRIO) 
>>> interface for data transfer. Let's say my host file chooses 4 out of the 7 
>>> nodes for an MPI job, based on the IP address, which are assigned to the 
>>> Ethernet interfaces.
>>> 
>>> My question is: where in the code does this mapping between 
>>> IP-to-whatever_interface_is_used_for_MPI_Send/Recv is determined, such as 
>>> only those chosen nodes receive traffic over the verbs interface?
>>> 
>>> Thanks in advance
>>> Durga
>>> 
>>> We learn from history that we never learn from history.
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> 
>>> de...@open-mpi.org
>>> 
>>> Subscription: 
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> Link to this post: 
>>> http://www.open-mpi.org/community/lists/devel/2016/04/18746.php
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/04/18747.php
>> 
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> 
>> de...@open-mpi.org
>> 
>> Subscription: 
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2016/04/18748.php
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18749.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2016/04/18750.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to