On Apr 16, 2009, at 8:58 PM, pranav jadhav wrote:

Thanks for providing the details. I was going through the code of MPI_Send and I found a function pointer being invoked mca_pml.send of struct mca_pml_base_module_t. I am trying to figureout when are these PML function pointers get initialized to call internal BTL functions.

There's a somewhat-complicated setup dance during MPI_INIT when all those function pointers get initialized. See below.

I am trying to know how MPI program communicate over TPC/IP for message passing in a distributed setup and would appreciate if you can provide more details or any report that you would like to share.

The BTL (Byte Transfer Layer) is OMPI's lowest-layer for point-to- point communications. The layering looks like this:

    MPI API
    PML (point-to-point messaging layer)
    BTL (byte transfer layer)

The PML also uses the BML (BTL multiplexing layer) to handle multiple BTLs simultaneously. I don't really list it in the layering above because it's just accounting functionality (arrays of BTL function pointers); it's not really a "layer" in the traditional sense.

BTW, know that the BTLs are only used by the OB1 and CSUM PMLs. There's a "dr" PML which is fairly dead at this point, and a CM PML, which, for lack of a longer description, is used with different kinds of networks (not TCP). So let's focus on OB1 / the TCP BTL.

At the bottom of MPI_SEND (and other point-to-point MPI API functions), you'll see a call to mca_pml.<foo>. This calls a function in the selected PML -- in your case, OB1. OB1 handles all the MPI logic for point-to-point message passing: all the rules, matching, ordering, and progression for MPI point-to-point message passing. The BTLs are "simple" bit-pushers. They know nothing about MPI. They take fragments from the PML and send them to peers. They receive fragments from peers and give them to the upper-level PML.

That's the 50k foot level description.

Most of the function pointers you care about are setup during MPI_INIT. There's a PML "selection" process that occurs -- Open MPI queries every PML that it can find (e.g., those that were built as plugins) and says "do you want to run?" If they answer yes, OMPI asks them "what's your priority?" OMPI then selects the 1 PML that says "yes, I want to run" with the highest priority. In your case, OB1 is getting selected. All other PMLs are closed and OB1s function pointers are loaded into the mca_pml struct. We then allow OB1 to initialize itself (since it "won" the selection process).

Keep in mind that OB1 is an engine/state machine: it doesn't know how to connect to or communicate with peers. It uses the BTLs for that. So part of OB1's initialization is selecting which BTLs to use (unlike the PML, where we only choose *1* PML to use at run-time, OB1 chooses as many BTLs as say "yes, I want to run"). OB1 uses the BML to manage the arrays of pointers to BTLs, but as I mentioned above, this is simple accounting/bookkeeping code -- if you look in the R2 BML module, it's just array manipulation stuff. Pretty straightforward. So OB1 (R2) opens up all the BTLs that it can find and queries them "do you want to run?" If the BTL answers "yes", then its function pointers get added to R2's internal store of pointers. We then let each BTL initialize itself (e.g., in TCP's case, open up a listening socket).

Open MPI's code tree is organized as follows:

ompi/ -- top-level directory for all MPI-related code
  mca/ -- top-level directory for all frameworks
    pml/ -- top-level directory for all pml components (plugins)
base/ -- top-level directory for pml "glue" code (i.e., shared between all pml plugins)
      ob1/ -- directory for all code in the ob1 component (plugin)
      cm/ -- directory for all code in the cm component
    bml/ -- top-level directory for all bml components
base/ -- top-level directory for bml "glue" code (i.e., shared between all bml plugins)
      r2/ -- top-level directory for the r2 component
    btl/ -- top-level directory for all btl components
base/ -- top-level directory for btl "glue" code (i.e., shared between all btl plugins)
      tcp/ -- top-level directory for the tcp component

...I think you can see the pattern here:

  ompi/mca/<framework name>/<component name>

where the <component name> of "base" is special: it's the "glue" code for that framework itself; it's not a component.

The interface for all plugins is always in a file of this form:

  ompi/mca/<framework>/<framework.h>

So look at ompi/mca/pml/pml.h and ompi/mca/btl/btl.h. We usually have a decent overview of the component interface in those files.

That's the short answer of how OB1 and the BTLs startup and setup all their function pointers. :-)

As for collectives, that's a different framework (e.g., as opposed to PML, BML, BTL): the coll framework. We have a bunch of different collective plugins available; which one(s) is(are) used depends on several factors.

The coll selection process is significantly different than that of the PML (e.g., OB1 and the BTLs), meaning that it's a bit more complex... Have a look in ompi/mca/coll/coll.h for a description of how that works. Hopefully, with the background that I've listed above, you can read the comments in that file and have it make some semblance of sense...

--
Jeff Squyres
Cisco Systems

Reply via email to