WHAT: Modify the errmgr framework's API and have the odls call it to update process state
WHY: Remove the current loop in mpirun whereby mpirun "messages" itself with updated state on local processes WHERE: Make the errmgr "proc_aborted" API more general, and thus change the few places it is called from within orte TIMEOUT: Tues April 6 telecon ====================================================================== Currently, the odls base functions update mpirun's knowledge regarding the state of local procs by one of two mechanisms: 1. if we are a daemon, we send the update info to the HNP via the rml. The HNP receives this in plm_base_receive, which then updates the relevant data objects and calls a function to check for job termination. If the proc abnormally terminated, then that function calls the errmgr to report it. 2. if we are the HNP, we post a "message event" that loops back to a different function that updates the proc state and calls the check job termination function. The proposed change would cleanup the "update proc state" procedure by removing the "message event" callback in the HNP. This would involve: 1. change the errmgr.proc_aborted API to be errmgr.update_proc_state (or pick your name) and add the proc state to the parameter list. 2. modify the odls_base_default_fns so that they call errmgr.update_proc_state instead of automatically sending a message back to the HNP or pushing a message event if we already are the HNP whenever a proc changes its state. This would allow us to remove the message event loop from the HNP, which will help with thread issues. It also allows the daemons to do something other than just notify the HNP, which is a motivator for me/Cisco. 3. modify plm_base_receive so that messages to update proc state are referred directly to the errmgr.update_proc_state API 4. create an orted module for the errmgr that sends an "update proc state" message back to the HNP 5. create an HNP module for the errmgr that locally updates the proc state, checks for job termination, etc. The code in the plm base function for that purpose (check_job_complete) would be moved here. This shouldn't impact anyone EXCEPT for those out there with branches that might get a conflict due to the errmgr API change. Hopefully that doesn't hit too many people.