Re: [OMPI devel] [OMPI users] How can I tell (open-)mpi about the HW topology ofmy system?

Jeff Squyres Tue, 27 Oct 2009 19:58:21 -0400

On Oct 27, 2009, at 6:12 AM, Luigi Scorzato wrote:

>However, we do have the good foresight (if I do say somyself ;-) ) to>make the MPI topology system be a plugin in Open MPI. The onlyplugin
 >for this system is currently the "do nothing" plugin, but it would
>*not* be difficult to write one that actually did somethingmeaningful
 >in your torus.
 >If you're interested, I'd be happy to explain how to do it (and we
 >should probably move to the devel list). OMPI doesn't require too
>much framework code; I would guess that the majority of the codewould
 >actually be implementing whatever algorithms you wanted for your
 >torus. Heck, you could even write a blind-and-dumb algorithm that
 >just looks up tables in files based on hostnames in your torus.

I am very much interested. Could you please suggest me where I should
look into?



(moved to devel from users list)

Open MPI has two entities that you need to know about: frameworks andcomponents (components are also referred to as "plugins"). Frameworksare the glue for a specific kind of component (plugin). For example,we have a framework for MPI point-to-point messages. We have anotherframework for MPI collective operations. We have another framework(the one you care about) for MPI topology operations. And so on. Ineach framework, there's one or more components (plugins) that areloaded and used at run-time to effect the functionality in thatframework.

Example: one of the MPI point-to-point messaging frameworks is calledthe "BTL" (byte transfer layer). We have a bunch of BTL components:one for TCP, one for shared memory, one for process loopback, one forMX, one for OpenFabrics verbs, ...etc. These plugins are effectively(eventually) called when you call MPI_SEND, MPI_RECV, ...etc.

Example: another MPI framework is "coll" -- MPI collectiveoperations. We have several components that effect differentalgorithms and transports underneath. These plugins are called whenyou call MPI_BARRIER, MPI_BCAST, MPI_SCATTER, ...etc.

Example: the "topo" MPI framework is for MPI topology operations. Wecurrently only have one component in this framework, named"unity" (because it makes no transformation of ranks). The functionsin these components are called when you call MPI_CART_CREATE,MPI_GRAPH_CREATE, ...etc.

Frameworks can be found in the OMPI source code in ompi/mca/<framework>. There's always a header file named ompi/mca/<framework>/<framework.h>. Components are always specific to a single framework,and can be found in the OMPI source code in ompi/mca/<framework>/<component>.

So you want to make a new topo component that can remap ranks based onyour network topology, perhaps in ompi/mca/topo/luigi/ or ompi/mca/topo/torus/ or whatever.


See these wiki pages:

  https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
  --> will give you an appreciation of what frameworks are
  https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent

--> step-by-step instructions on how to make a new luigi or torusor whatever component

I would suggest getting an SVN checkout of the OMPI trunk (see http://www.open-mpi.org/svn/)and working on your new component there.

The file ompi/mca/topo/topo.h file has a decent description of thetopo component interface (i.e., the functions that your new componentwill need to provide). Note that the MPI cartesian and graphcommunicator interfaces were cleverly designed such that all the cartfunctions can be implemented in terms of MPI_CART_MAP and all thegraph functions can be implemented in terms of MPI_GRAPH_MAP. Soaside from OMPI "glue" code, your plugins may only need to providethose two functions to be fully functional.

I'd advise using the unity component as an example to create a newcomponent, and then fill in whatever algorithms you want.

Some more OMPI terminology: a "module" is an "instance" of acomponent. Think of a "component" as a C++ class; think of a "module"as C++ object. The "base" is the glue of a framework that makes itrun (e.g., the functions for opening the framework, traversing foundcomponents, closing the framework, etc.).

The basic startup sequence is that OMPI will call the init_queryfunction on your component the first time MPI_CART_CREATE orMPI_GRAPH_CREATE is invoked and see if it wants to run. If it does,the component is added to a list of "available" components.

Every time a graph or cart communciator is created, the list ofavailable topo components is traversed and the component comm_queryfunction is invoked. The comm_query function indicates whether it canbe used or not by returning a module or a NULL. The base maintains alist of modules that were returned and selects the one with thehighest priority. comm_unquery is called on all the losers;module_init is invoked on the winner.

Check out the code in ompi/mca/topo/base/topo_base_comm_select.c --there's a good amount of comments in there about how per-communicatorselection occurs.

--> Hmm. I'm looking at the prototype for comm_query in topo.h and itdoesn't take a list of processes. This seems like a bad idea; acomponent may only be able to run on a subset of processes in theoverall MPI job (e.g., if you have a shared-memory topology component,it would only allow itself to be used at run-time if all processes inthe communicator are physically located on the same node). Hmm. Wemight want to update this prototype to include a list of processesthat you can check to see if your component is eligible.Additionally, it seems weird that the comm_unquery function is on thecomponent -- it really should be on the module (editor's note: thisframework was created way back during the beginning of OMPI and likelyhasn't been touched since... I think it's showing its age :-\ ).

Once a module is selected, its function pointers effectively becomethe back-ends to functions like MPI_CART_CREATE, MPI_GRAPH_CREATE,etc. Note that you can implement all the topology functions in termsof MPI_CART_MAP and MPI_GRAPH_MAP (this is what unity does). If youprovide NULL for all the other function pointers, the base willautomatically insert functions that implement themselves by callingyour module's cart_map and graph_map functions.

Note that in order to save some space, we overlap the meanings of somefields (graph dimensions or list of indexes). In hindsight, I'm notsure why we didn't use a union. :-\

Finally, when the communicator is destroyed, the module_finalizefunction is invoked.


=====

Based on my "Hmm..." comment above, I think I want to revamp theselection logic a little before you dive too deeply into this -- tomodernize it and make it a bit more like the rest of the OMPI codebase; you can tell that this code was created a long time ago and thenhas been touched since (you're the first person to express interest increating a real topo component! :-) ). I've created a Mercurialbranch of the OMPI trunk for this work and published it here:


    http://bitbucket.org/jsquyres/ompi-topo-fixes/

Give me a few days to get this branch into shape (and potentially toget it back to the SVN trunk). I might even get inspired to make atemplate 2nd component for you (i.e., I might need a 2nd componentjust to ensure that the selection logic is working :-) ).


--
Jeff Squyres
jsquy...@cisco.com

Re: [OMPI devel] [OMPI users] How can I tell (open-)mpi about the HW topology ofmy system?

Reply via email to