Hi folks There is a PR that has cleared Jenkins, but it represents a fairly significant change in OMPI capabilities. Thus, I think it merits a little more attention.
The PR (https://github.com/open-mpi/ompi/pull/1767 <https://github.com/open-mpi/ompi/pull/1767>) brings the PMIx event notification system into OMPI. Quoting from the PMIx RFC: =============================== The PMIx Event Notification system provides a mechanism by which the resource manager can communicate system events to applications, thus providing applications with an opportunity to generate an appropriate response. In addition, applications can use the system to request that the resource manager notify their peers of internal events (e.g., computational errors and aborted operations), and notify the resource manager of events detected by the application. The resource manager will be aware of a wide range of events that occur across the system. For the purposes of this discussion, only events that impact the allocated session being served by the PMIx server are considered. These events can be divided into two distinct classes: * Job-specific events that directly relate to a job executing within the session. This might include events such as debugger attachment or process failure within a related job. These events are characterized by directly targeting processes within session jobs - i.e., the "procs" parameter of the notification contain members of a job executing within the session. Events in this category are to be immediately delivered to the PMIx server library for delivery to the specified processes. Clients can indicate a desire to register solely for job-specific events by including the _PMIX\_EVENT\_JOB\_LEVEL_ key in their call to _PMIx\_Register\_event_ - i.e., providing this key will explicitly indicate that environment events are _not_ to be reported to this callback function. * Environment events that impact the session, but are not directly sent to executing jobs. This is a much broader category of events that includes ECC errors, temperature excursions, and other environmental events directly affecting the session's resources. Note that although these do impact the session's jobs, they are not directly referencing those jobs - i.e., the event is generated without specifying a particular target. Thus, events in this category are to be delivered to the PMIx server library only upon request - i.e., when the PMIx server has registered for those events. Note that race conditions can cause the registration to come _after_ events of possible interest (e.g., a memory ECC event that occurs after start of execution but prior to registration). RMs are free to cache events in this category for some time to mitigate this situation, but are not required to do so. Thus, applications must be aware that environment events prior to registration may not be included in notifications. As above, clients can indicate a desire to register solely for environment events of a given type by include the _PMIX\_EVENT\_ENVIRO\_LEVEL_ key in their registration call. The PMIx server will cache any environment events passed to it for a period of time to provide notification to clients that have not yet registered for them. Currently, the PMIx server uses a ring buffer to cache events. The size of the ring buffer defaults to 512 events (as of PMIx 2.0), but can be configured using the _PMIx\_server\_cache\_size_ info key during the call to the _PMIx\_Server\_init_ API. Client application processes can also use the PMIx Event Notification system to request that the resource manager notify its peers of internal events, and notify the resource manager of events detected by the application process. Examples of the latter include network communication errors that may not have been detected by the fabric manager itself (e.g., data corruption). The client must direct the notification to the appropriate target (RM or peers) using the corresponding range parameter. =============================== The biggest change for OMPI is that it enables you to register event handlers for specific error constants - e.g., for knowing when debugger release has been issued. What you do in response to that notification is totally up to you, and we do “chain” the handlers (and pass the output of one down to the following handlers). This should not be considered a cast-in-concrete capability - it will evolve as folks start to use it. However, we believe the interfaces should now be stable and ready for use. The changes include: * upgrade the base PMIx installation to 2.0.0a1, tracking (but lagging) the PMIx master * creating a PMIx 1.1.4-specific external component for backward compatibility * adding a PMIx 2.x-specific external component for those wanting to build directly against the PMIx master * converting debugger support to use PMIx instead of RML for release. Note that the OOB/usock component remains for show_help support until the upcoming PMIx_Log interface is available. Please provide any comments or concerns. I’m planning to “hold” this PR a bit while we resolve the OMPI 2.0 issues. Ralph