The ft_event() function that you mentioned is part of the larger fault tolerance infrastructure in Open MPI. You need to make sure to enable it before using (if it is not enabled many of the ft_event functions default to NULL). Add '--with-ft=cr' to your ./configure line and that will enable the FT infrastructure.

As Jeff mentioned you might be able to use the Checkpoint/Restart Coordination Protocol (CRCP) framework [located in ompi/mca/crcp] to halt messaging. It works as a wrapper around the PML, so you are operating on whole MPI messages, not fragments as in the BTLs below. But it might be another option to consider.

-- Josh

On Jan 11, 2010, at 5:08 PM, Jeff Squyres wrote:

Additionally, I believe that the FT system already does something like what you describe (although perhaps not exactly the same thing) -- there is a phase where the FT system pauses and quiesces all BTLs.

Did you look at that part of the code, perchance, and see if it meets your needs?


On Jan 11, 2010, at 3:53 PM, Christoph Konersmann wrote:

Thanks a lot for your help! I will give it a try.

Christoph

Ralph Castain schrieb:
You've got this a tad wrong, but that's okay - let me try to clarify a couple of things that may help.

First, you don't want to add this as a separate orted command. As you noted, orte has no direct way to tell the OMPI layer to do anything. Instead, you want to pass a message to the process that is received in the OMPI layer. That is easy to do.

1. add a message tag in ompi/mca/dpm/dpm.h - perhaps something like OMPI_RML_TAG_BTL_CTL

2. in the btl, add a call to orte_rml.recv_nb() that identifies the above tag and specifies a callback function to use when such a message arrives

3. in that callback function, toggle your "paused" flag - or you can unpack the buffer to get a flag telling you what value to set. Your choice.

Now, when you want to pause the BTL, you do an orte_grpcomm.xcast() to the above message tag. ORTE will deliver that message to every process, which will then have its callback function called.

HTH
Ralph

--
Paderborn Center for Parallel Computing - PC2
University of Paderborn - Germany
http://www.pc2.de

Christoph Konersmann <c...@upb.de>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
jsquy...@cisco.com


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to