On Oct 27, 2009, at 6:12 AM, Luigi Scorzato wrote:

>However, we do have the good foresight (if I do say so myself ;-) ) to >make the MPI topology system be a plugin in Open MPI. The only plugin
 >for this system is currently the "do nothing" plugin, but it would
>*not* be difficult to write one that actually did something meaningful
 >in your torus.

 >If you're interested, I'd be happy to explain how to do it (and we
 >should probably move to the devel list). OMPI doesn't require too
>much framework code; I would guess that the majority of the code would
 >actually be implementing whatever algorithms you wanted for your
 >torus. Heck, you could even write a blind-and-dumb algorithm that
 >just looks up tables in files based on hostnames in your torus.

I am very much interested. Could you please suggest me where I should
look into?



(moved to devel from users list)

Open MPI has two entities that you need to know about: frameworks and components (components are also referred to as "plugins"). Frameworks are the glue for a specific kind of component (plugin). For example, we have a framework for MPI point-to-point messages. We have another framework for MPI collective operations. We have another framework (the one you care about) for MPI topology operations. And so on. In each framework, there's one or more components (plugins) that are loaded and used at run-time to effect the functionality in that framework.

Example: one of the MPI point-to-point messaging frameworks is called the "BTL" (byte transfer layer). We have a bunch of BTL components: one for TCP, one for shared memory, one for process loopback, one for MX, one for OpenFabrics verbs, ...etc. These plugins are effectively (eventually) called when you call MPI_SEND, MPI_RECV, ...etc.

Example: another MPI framework is "coll" -- MPI collective operations. We have several components that effect different algorithms and transports underneath. These plugins are called when you call MPI_BARRIER, MPI_BCAST, MPI_SCATTER, ...etc.

Example: the "topo" MPI framework is for MPI topology operations. We currently only have one component in this framework, named "unity" (because it makes no transformation of ranks). The functions in these components are called when you call MPI_CART_CREATE, MPI_GRAPH_CREATE, ...etc.

Frameworks can be found in the OMPI source code in ompi/mca/ <framework>. There's always a header file named ompi/mca/<framework>/ <framework.h>. Components are always specific to a single framework, and can be found in the OMPI source code in ompi/mca/<framework>/ <component>.

So you want to make a new topo component that can remap ranks based on your network topology, perhaps in ompi/mca/topo/luigi/ or ompi/mca/ topo/torus/ or whatever.

See these wiki pages:

  https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateFramework
  --> will give you an appreciation of what frameworks are
  https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent
--> step-by-step instructions on how to make a new luigi or torus or whatever component

I would suggest getting an SVN checkout of the OMPI trunk (see http://www.open-mpi.org/svn/) and working on your new component there.

The file ompi/mca/topo/topo.h file has a decent description of the topo component interface (i.e., the functions that your new component will need to provide). Note that the MPI cartesian and graph communicator interfaces were cleverly designed such that all the cart functions can be implemented in terms of MPI_CART_MAP and all the graph functions can be implemented in terms of MPI_GRAPH_MAP. So aside from OMPI "glue" code, your plugins may only need to provide those two functions to be fully functional.

I'd advise using the unity component as an example to create a new component, and then fill in whatever algorithms you want.

Some more OMPI terminology: a "module" is an "instance" of a component. Think of a "component" as a C++ class; think of a "module" as C++ object. The "base" is the glue of a framework that makes it run (e.g., the functions for opening the framework, traversing found components, closing the framework, etc.).

The basic startup sequence is that OMPI will call the init_query function on your component the first time MPI_CART_CREATE or MPI_GRAPH_CREATE is invoked and see if it wants to run. If it does, the component is added to a list of "available" components.

Every time a graph or cart communciator is created, the list of available topo components is traversed and the component comm_query function is invoked. The comm_query function indicates whether it can be used or not by returning a module or a NULL. The base maintains a list of modules that were returned and selects the one with the highest priority. comm_unquery is called on all the losers; module_init is invoked on the winner.

Check out the code in ompi/mca/topo/base/topo_base_comm_select.c -- there's a good amount of comments in there about how per-communicator selection occurs.

--> Hmm. I'm looking at the prototype for comm_query in topo.h and it doesn't take a list of processes. This seems like a bad idea; a component may only be able to run on a subset of processes in the overall MPI job (e.g., if you have a shared-memory topology component, it would only allow itself to be used at run-time if all processes in the communicator are physically located on the same node). Hmm. We might want to update this prototype to include a list of processes that you can check to see if your component is eligible. Additionally, it seems weird that the comm_unquery function is on the component -- it really should be on the module (editor's note: this framework was created way back during the beginning of OMPI and likely hasn't been touched since... I think it's showing its age :-\ ).

Once a module is selected, its function pointers effectively become the back-ends to functions like MPI_CART_CREATE, MPI_GRAPH_CREATE, etc. Note that you can implement all the topology functions in terms of MPI_CART_MAP and MPI_GRAPH_MAP (this is what unity does). If you provide NULL for all the other function pointers, the base will automatically insert functions that implement themselves by calling your module's cart_map and graph_map functions.

Note that in order to save some space, we overlap the meanings of some fields (graph dimensions or list of indexes). In hindsight, I'm not sure why we didn't use a union. :-\

Finally, when the communicator is destroyed, the module_finalize function is invoked.

=====

Based on my "Hmm..." comment above, I think I want to revamp the selection logic a little before you dive too deeply into this -- to modernize it and make it a bit more like the rest of the OMPI code base; you can tell that this code was created a long time ago and then has been touched since (you're the first person to express interest in creating a real topo component! :-) ). I've created a Mercurial branch of the OMPI trunk for this work and published it here:

    http://bitbucket.org/jsquyres/ompi-topo-fixes/

Give me a few days to get this branch into shape (and potentially to get it back to the SVN trunk). I might even get inspired to make a template 2nd component for you (i.e., I might need a 2nd component just to ensure that the selection logic is working :-) ).

--
Jeff Squyres
jsquy...@cisco.com

Reply via email to