Re: [OMPI devel] RFC: Component-izing MPI_Op

Jeff Squyres Tue, 13 Jan 2009 11:58:45 -0500

On the call today, no one had any objections to bringing this stuff tothe trunk. v1.2.9 and v1.3.0 releases have a higher priority, so I'llbring this stuff over to the trunk when those two releases are done(hopefully tomorrow!).


On Jan 10, 2009, at 2:21 PM, Jeff Squyres wrote:

FWIW, I've finished a first cut of this stuff. I'll provide anoverview on next Tuesday's teleconf.
I didn't "fix" MPI_REPLACE yet (it does seem to be a differentissue; I mainly extended what was already there) but I've done mostof the rest of the work:
- Created a new op framework that was inspired by the coll framework.

- Similar to the "coll" framework, the op framework supports:
   - Mixing-n-matching op modules on a single MPI_Op
- "Stacking" op modules (e.g., choose at invocation time whethera module will use its back-end hardware, or whether it should fallback to a different module's implementation)
- Unlike the coll framework, all the "basic" functions are in the opbase and are pre-loaded onto the MPI_Op during selection as the 0thpriority (so you can stack them naturally -- base functions evenhave a [bogus] module, so you can RETAIN them just like any othermodule) -- there's no "basic" component or set of modules.
- Created an "example" op component that has a few sample routinesand shows a bunch of different OMPI concepts, both in the opframework and utilizing other parts of the OMPI code base (hopefullyhelpful to newbie OMPI component authors).
==> NOTE: The example op is currently fairly chatty withopal_output() so that you can see that it is being used.I'll .ompi_ignore it (or something) when it is brought into thetrunk so that the example component isn't active in production runs.
- Created wiki pages describing autogen, how to create a framework,and how to create a component (hopefully helpful to newbie OMPIcomponent authors).
=======================
I think that the second phase of this work will be the varioushardware providers providing their components to Open MPI (e.g.,cuda, opencl, IBM Cell, ...etc.).
If this all proves worthwhile, I think a third phase of this workcould be optimizing the top-level reduction calls based on whatnodes have hardware acceleration and which do not (e.g., ifaccelerators are not available in all nodes, that may changes thecollection/reduction communication pattern).
On Jan 5, 2009, at 10:21 AM, Jeff Squyres wrote:
On Jan 5, 2009, at 10:09 AM, Brian W. Barrett wrote:
I think this sounds reasonable, if (and only if) MPI_Accumulate isproperly handled. The interface for calling the op functions wasbroken in some fairly obvious way for accumulate when I waswriting the one-sided code. I think I had to call some supposedlyinternal bits of the interface to make accumulate work. I can'tremember what they are now, but I do remember it being a problem.
Coolio; I'll look into it.
Of course, unless it makes mpi_allreduce on one double-sizedfloating point number using sum go faster, I'm not entirely sure achange is helpful ;).
From my (admittedly limited) understanding, since there are memoryregistration and/or copy in/out issues with GPUs, the operation hasto be "big enough" and/or already located in GPU memory for the GPUto outperform the CPU. It is my assumption that the component-izedCUDA/OpenCL/whatever code will need to make a decision whether itshould perform the operation at run-time or pass it back to afallback [probably CPU-based] implementation, analogous to how"tuned" picks the right coll algorithm.
I'm told that there's some researchy middleware working on exactlythis kind of problem (determining if a given operation is suitableto run on the GPU or the main CPU). So in a best-case scenario,OMPI can just link against and use that middleware rather thanimplementing all the logic in the component itself. We'll see howit plays out.
My goal is to give these guys the infrastructure that they need inOMPI to play with these kind of concepts and see what they canaccomplish in terms of real performance. FWIW: a few SC08attendees thought that they could avoid writing much CUDA/CL/whatever code if MPI_REDUCE did the work for them (particularly ifpaired with the proposed MPI_REDUCE_LOCAL function, https://svn.mpi-forum.org/trac/mpi-forum-web/ticket/24). [shrug] We'll see!
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] RFC: Component-izing MPI_Op

Reply via email to