WHAT: Modify the errmgr framework's API and have the odls call it to update 
process state

WHY: Remove the current loop in mpirun whereby mpirun "messages" itself with 
updated state on local processes

WHERE: Make the errmgr "proc_aborted" API more general, and thus change the few 
places it is called from within orte

TIMEOUT: Tues April 6 telecon

======================================================================
Currently, the odls base functions update mpirun's knowledge regarding the 
state of local procs by one of two mechanisms:

1. if we are a daemon, we send the update info to the HNP via the rml. The HNP 
receives this in plm_base_receive, which then updates the relevant data objects 
and calls a function to check for job termination. If the proc abnormally 
terminated, then that function calls the errmgr to report it.

2. if we are the HNP, we post a "message event" that loops back to a different 
function that updates the proc state and calls the check job termination 
function.

The proposed change would cleanup the "update proc state" procedure by removing 
the "message event" callback in the HNP. This would involve:

1. change the errmgr.proc_aborted API to be errmgr.update_proc_state (or pick 
your name) and add the proc state to the parameter list.

2. modify the odls_base_default_fns so that they call errmgr.update_proc_state 
instead of automatically sending a message back to the HNP or pushing a message 
event if we already are the HNP whenever a proc changes its state. This would 
allow us to remove the message event loop from the HNP, which will help with 
thread issues. It also allows the daemons to do something other than just 
notify the HNP, which is a motivator for me/Cisco.

3. modify plm_base_receive so that messages to update proc state are referred 
directly to the errmgr.update_proc_state API

4. create an orted module for the errmgr that sends an "update proc state" 
message back to the HNP

5. create an HNP module for the errmgr that locally updates the proc state, 
checks for job termination, etc. The code in the plm base function for that 
purpose (check_job_complete) would be moved here.

This shouldn't impact anyone EXCEPT for those out there with branches that 
might get a conflict due to the errmgr API change. Hopefully that doesn't hit 
too many people.


Reply via email to