Re: [OMPI devel] MPI Message Communication over TCP/IP

Jeff Squyres Fri, 17 Apr 2009 08:30:41 -0400

On Apr 16, 2009, at 8:58 PM, pranav jadhav wrote:

Thanks for providing the details. I was going through the code ofMPI_Send and I found a function pointer being invoked mca_pml.sendof struct mca_pml_base_module_t. I am trying to figureout when arethese PML function pointers get initialized to call internal BTLfunctions.

There's a somewhat-complicated setup dance during MPI_INIT when allthose function pointers get initialized. See below.

I am trying to know how MPI program communicate over TPC/IP formessage passing in a distributed setup and would appreciate if youcan provide more details or any report that you would like to share.

The BTL (Byte Transfer Layer) is OMPI's lowest-layer for point-to-point communications. The layering looks like this:


    MPI API
    PML (point-to-point messaging layer)
    BTL (byte transfer layer)

The PML also uses the BML (BTL multiplexing layer) to handle multipleBTLs simultaneously. I don't really list it in the layering abovebecause it's just accounting functionality (arrays of BTL functionpointers); it's not really a "layer" in the traditional sense.

BTW, know that the BTLs are only used by the OB1 and CSUM PMLs.There's a "dr" PML which is fairly dead at this point, and a CM PML,which, for lack of a longer description, is used with different kindsof networks (not TCP). So let's focus on OB1 / the TCP BTL.

At the bottom of MPI_SEND (and other point-to-point MPI APIfunctions), you'll see a call to mca_pml.<foo>. This calls a functionin the selected PML -- in your case, OB1. OB1 handles all the MPIlogic for point-to-point message passing: all the rules, matching,ordering, and progression for MPI point-to-point message passing. TheBTLs are "simple" bit-pushers. They know nothing about MPI. Theytake fragments from the PML and send them to peers. They receivefragments from peers and give them to the upper-level PML.


That's the 50k foot level description.

Most of the function pointers you care about are setup duringMPI_INIT. There's a PML "selection" process that occurs -- Open MPIqueries every PML that it can find (e.g., those that were built asplugins) and says "do you want to run?" If they answer yes, OMPI asksthem "what's your priority?" OMPI then selects the 1 PML that says"yes, I want to run" with the highest priority. In your case, OB1 isgetting selected. All other PMLs are closed and OB1s functionpointers are loaded into the mca_pml struct. We then allow OB1 toinitialize itself (since it "won" the selection process).

Keep in mind that OB1 is an engine/state machine: it doesn't know howto connect to or communicate with peers. It uses the BTLs for that.So part of OB1's initialization is selecting which BTLs to use (unlikethe PML, where we only choose *1* PML to use at run-time, OB1 choosesas many BTLs as say "yes, I want to run"). OB1 uses the BML to managethe arrays of pointers to BTLs, but as I mentioned above, this issimple accounting/bookkeeping code -- if you look in the R2 BMLmodule, it's just array manipulation stuff. Pretty straightforward.So OB1 (R2) opens up all the BTLs that it can find and queries them"do you want to run?" If the BTL answers "yes", then its functionpointers get added to R2's internal store of pointers. We then leteach BTL initialize itself (e.g., in TCP's case, open up a listeningsocket).


Open MPI's code tree is organized as follows:

ompi/ -- top-level directory for all MPI-related code
  mca/ -- top-level directory for all frameworks
    pml/ -- top-level directory for all pml components (plugins)

base/ -- top-level directory for pml "glue" code (i.e., sharedbetween all pml plugins)

      ob1/ -- directory for all code in the ob1 component (plugin)
      cm/ -- directory for all code in the cm component
    bml/ -- top-level directory for all bml components

base/ -- top-level directory for bml "glue" code (i.e., sharedbetween all bml plugins)

      r2/ -- top-level directory for the r2 component
    btl/ -- top-level directory for all btl components

base/ -- top-level directory for btl "glue" code (i.e., sharedbetween all btl plugins)

      tcp/ -- top-level directory for the tcp component

...I think you can see the pattern here:

  ompi/mca/<framework name>/<component name>

where the <component name> of "base" is special: it's the "glue" codefor that framework itself; it's not a component.


The interface for all plugins is always in a file of this form:

  ompi/mca/<framework>/<framework.h>

So look at ompi/mca/pml/pml.h and ompi/mca/btl/btl.h. We usually havea decent overview of the component interface in those files.

That's the short answer of how OB1 and the BTLs startup and setup alltheir function pointers. :-)

As for collectives, that's a different framework (e.g., as opposed toPML, BML, BTL): the coll framework. We have a bunch of differentcollective plugins available; which one(s) is(are) used depends onseveral factors.

The coll selection process is significantly different than that of thePML (e.g., OB1 and the BTLs), meaning that it's a bit more complex...Have a look in ompi/mca/coll/coll.h for a description of how thatworks. Hopefully, with the background that I've listed above, you canread the comments in that file and have it make some semblance ofsense...


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] MPI Message Communication over TCP/IP

Reply via email to