Re: [OMPI devel] RFC: revamp topo framework

Jeff Squyres Tue, 3 Nov 2009 06:17:34 -0500

On Nov 3, 2009, at 3:40 AM, Luigi Scorzato wrote:

This defines the precise relation between ranks and coordinates. Once
I know this, I do not even need to write a topo component, because I
can define the ranks of my computing nodes in a rankfile in order
that they get the coordinates that they need physically.

Fair enough. A topo component would make it unnecessary to lay outyour processes in a specific order because it could (hypothetically)understand your physical topology and re-order the ranks accordingly.

A different issue is the BTL component. This is actually where my
approach 1 and 2 differ (my previous distinction was confusing, due
to my lack of understanding of the distinction between topo and btl
components).

In the 1st approach I would redefine some crucial (for my code) MPI
functions in a way that they call the low level torus primitives,
when the communication occurs between nearest neighbors, and fall
back to open-mpi functions otherwise.
The 2nd approach would be to develop our torus-btl. The fact that one
can choose a "priority list of networks" is definitely great and
dissipates my worries about the feasibility of the 2nd approach in my
case. The only remaining question is whether I can get familiar with
btl stuff fast enough. What do you suggest me to read in order to
learn quickly how to create a BTL component?

The BTL is a bit more complicated than topo -- topo is actually prettystraightforward. BTL is a dumb byte-pusher that is controlled by anupper-level framework: the Point-to-point Messaging Layer (PML). ThePML effects the semantics of the MPI point-to-point communications;PML components are the back-ends to MPI_SEND and friends. The PMLinitializes BTLs during MPI_INIT and builds up the priority lists ofnetworks, etc. Then during MPI_SEND (etc.), the PML uses thisinformation to decide what to do with messages -- fragment them overmultiple BTLs, etc. It then calls the BTL modules in question toactually do the send. On receive, the BTLs make upcalls to the PMLsaying "here's a fragment; you handle it".

Hence, in this way, the BTLs are dumb byte pushers -- they simply sendand receive to individual peers (without any MPI semantics at all) andgive all the fragments they receive to the PML, who then effects allthe MPI semantics.

Read ompi/mca/btl/btl.h and ompi/mca/pml/pml.h for the details of theinterfaces.

Are the network primitives of your network like TCP (reads and writescan partially complete), or are they like Myrinet / IB (messages areread and written discretely, potentially also starting reads andwrites and later receiving completion calls indicating that theyfinished)?


--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] RFC: revamp topo framework

Reply via email to