Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

Jeff Squyres Wed, 4 Apr 2007 10:24:15 -0400

On Apr 3, 2007, at 4:57 PM, po...@cc.gatech.edu wrote:

I need to find when the underlying network is free. Means I dont need to
go into the details of how MPi_send is implemented.


Ah, ok.  That explains a lot.

What I want to know is when the MPI_Send is started .Or rather when MPi
does not use the underlying network.

I need to find timing for
1) When the application issue send command

This (and #5) can be implemented with a PMPI-based intercept library (I assume that by "command", you mean "API function call").

2) When Mpi actually issues send command
3) When does BTl perform atual transfer(send)

What are you looking to distinguish here? I.e., what is the difference between 1 and 2 vs. 3?

Open MPI has an MPI_Send() function in C that does some error checking and then invokes an underlying "send" function (via function pointer) to a plugin that starts doing the setup for the MPI semantics for the send. Eventually, another function pointer is used to invoke the "send" function in the BTL to actually send the message. More setup is performed down in the BTL (usually dealing with setting up data structures to invoke the underlying network/OS/ driver "send" function that starts the network send), and then we invoke some underlying OS/kernel-bypass function to start the network transfer. Note that all we can guarantee is that the transfer start sometime after that -- there's no way to know *exactly* when it starts because the underlying kernel driver may choose to defer it for a while based on flow control, available resources, etc.

Specifically, similar to one of my prior e-mails, the calling structure is something like this:


MPI_Send()
  --> PML plugin (usually the "ob1" plugin)

--> BTL plugin (one of the components in the ompi/mca/btl/ directory)

        --> underlying OS/kernel-bypass function

4) When doe send complete

By "complete", what exactly are you looking for? There's several definitions possible here:


- when any of the "send" functions listed above returns

- when the underlying network driver tells us that it is complete (a.k.a. "local completion" -- it *DOES NOT* imply that the receiver has even started to receive the message, nor that the message has even left the host yet)

- when he receiver ACK's receiving the message
- when MPI_Send() returns

FWIW, we usually measure local completion time because that's all that we can know (because the underlying network driver makes its own decisions about when messages are put out on the network, etc., and we [i.e., any user-level networking software] don't have visibility of that information).

5) Who was thr receiver.
etc. this was an example of MPi_send.
like this I need to know MPI_Isend,broadcast etc.

I guess this can be done using PMPI.


Some of this can, yes.

But PMPI can do it during profile stages while I want all this data during
runtime.

I don't quite understand this statement -- PMPI is a run-time profiling system. All it does is insert your shim PMPI layer between the user's application and the "real" MPI layer.

So that I can improve the performance of the system while using that ideal
time.

What I'm piecing together from your e-mails is that you want to use MPI in conjunction with using the network directly, either through the BTLs or some other communication library (i.e., both MPI and your other communication library will share the same network resources) and you're trying to find out when MPI is not using a particular network resource so that you can use it with your other communication library in order to maximize utilization and minimize contention / congestion. Is that correct?


Is that right?

Well I/o used is Lustre (its ROMIO).

Note that ROMIO uses a fair bit of MPI sending and receiving before using the underlying file system. So you'll have at least 2 layers of software to explore to find out when the network is free/busy.

What I mean by I/O node is nodes that does input and ouput processing i.e they write to lustre and compute node just transfer data to i/o node to write it in Lustre.Compute node does not have memory at all.So when ever
they have something to write it gets transfered to I/o node.
and then I/o node does read and write.

Ok. I'm guessing/assuming that this is multiplexing that is either done in ROMIO or in Lustre itself.

So when MPi_send is not issued the the network(Infiniband interconnect)
can be used for some other transfer.


Makes sense.

Can anyone help me wih how to go abt tracing this at run time?

The BTL plugin that you will be concerned with is the "openib" BTL (in the Open MPI source tree: ompi/mca/btl/openib/*) -- assuming that you are using an OpenFabrics/OFED-based network driver on your nodes (if you're using an older mvapi-based network driver, you'll use the mvapi BTL: ompi/mca/btl/mvapi/* -- but I would not recommend this because all current and future effort for InfiniBand in OMPI is being doing with OFED/the openib BTL).

Be warned: IB networks are highly flexible and therefore the API for it is fairly complex. The native API for OFED-based IB verbs is in a library called "libibverbs" -- man pages for the particular verbs function calls will be included in the next OFED release (OFED v1.2), so you probably don't have them loaded on your cluster. I've attached a tarball of the man pages for you. You'll need these man pages to understand what the openib BTL is doing.

If what you want to do is figure out when OMPI's openib BTL is not using the network, you need to a) understand the BTL interface (and to some extent, how the "ob1" PML uses it), b) understand at least generally how the InfiniBand verbs API (functions that begin with ibv_*()) work, and c) understand how the openib BTL works.


To that end, I'd suggest the following:

a) Understand the BTL interface: see ompi/mca/btl/btl.h for the BTL plugin interface and at least some comments about how it is used. Also see the slides from the OMPI Developer's Workshop (especially Wednesday, the day where point-to-point communications were covered):


http://www.open-mpi.org/papers/workshop-2006/

b) Read up on the IBV function call man pages. Understand a few major concepts before starting:

- Using the IB network requires the use of "registered" memory. Meaning that any messages sent or received across the IB network must use the special "registered" memory. OMPI dedicates a *lot* of code to managing registered memory (you'll see references to the rdma mpool [memory pool] component in the OMPI trunk -- it's slightly different on the v1.2 series branch)

- Most IB actions are asynchronous. So when we send a message, you simply create a work queue entry (WQE) and put it on a work queue (WQ). The IB NIC takes over from there and progresses the send. When the send has completed (local completion only; does not imply that the receiver has even started to receive), an entry will appear on the completion queue (CQ) telling OMPI that it is done and the message buffer can now be re-used/deallocated/whatever.

- Note that registering and de-registering memory is synchronous and can be fairly expensive (i.e., slow).

- All IB communication is done through queue pairs (QPs): a send queue and a receive queue. QP's are analogous to sockets -- you open a QP between two processes. You then send to that peer by creating a WQE for the send queue and putting it on the WQ. The NIC will then progress the send buffer and when local completion occurs, will put an entry on the CQ.

- There is no such thing as an unexpected message in IB -- you *must* pre-post buffers to receive messages. Hence, OMPI posts a bunch of buffers during init to receive messages via received queues in QPs. These buffers are used when you use "send / receive" semantics for IB message passing.

- Additionally, IB networks can utilize RDMA for message passing -- meaning that you don't send messages into pre-posted received buffers, but rather give an address to send the message directly to in the peer process. This is called "RDMA semantics" for IB message passing (as opposed to "send / receive semantics"). There is some additional cost to this form of message passing because you have to exchange the target address from the receiver to the sender.

- OMPI makes QPs lazily. That is, there is a bunch of code dedicated to coordinating and creating QPs when the first MPI_SEND is exchanged between a pair of MPI peer processes. Specifically, if you have an MPI process that calls MPI_INIT and MPI_FINALIZE (and no MPI_SEND/ MPI_RECV functions), we won't make any QP's between MPI processes.


- The openib BTL generally does the following:

- For short messages, RDMA is used for a limited set of peer processes (because RMDA consumes registered memory). Specifically, the first N processes that connect to a given process will be allowed to use RDMA for short messages. - For the N+1st (and beyond) peer that connects, send/receive semantics are used. - A complex protocol is used for long messages. It is described in this paper:

    http://www.open-mpi.org/papers/euro-pvmmpi-2006-hpc-protocols/

- Open MPI also employs the PERUSE statistics-gathering profiler, which may be helpful to you. See this paper for details:

    http://www.open-mpi.org/papers/euro-pvmmpi-2006-peruse/

- This paper also describes some scalability issues with IB (particularly with consuming registered memory and something called shared receive queues [SRQ]):

    http://www.open-mpi.org/papers/ipdps-2006/

Hope this is helpful to you.

--
Jeff Squyres
Cisco Systems

ibverbs-ofed-1.2.tar.gz
Description: GNU Zip compressed data

Re: [OMPI devel] Is it possible to get BTL transport work directly with MPI level

Reply via email to