This RFC has been cancelled.

It proved impossible to implement cleanly given the current "composite" nature 
of the errmgr framework. This creates an indeterminate state issue when falling 
between errmgr modules once async messaging is implemented. The only solution 
we could find was a giant global "lock" across ORTE, which defeated the purpose 
of async operations.

Once Josh commits his other modules, we will work with him to review those 
modules and see how we can reorganize this framework to a more conventional 
mode of operation where the individual modules are independent.

Meantime, we have moved the orcm module to the orcm code base and will 
"no-build" all other modules to avoid the problems. Unfortunately, this means 
we will also have to copy significant chunks of orte code over to orcm until 
this can be resolved, but it hopefully won't be too far into the future before 
the two code bases can be re-merged.


On Apr 1, 2010, at 10:35 AM, Ralph Castain wrote:

> WHAT: Modify the errmgr framework's API and have the odls call it to update 
> process state
> 
> WHY: Remove the current loop in mpirun whereby mpirun "messages" itself with 
> updated state on local processes
> 
> WHERE: Make the errmgr "proc_aborted" API more general, and thus change the 
> few places it is called from within orte
> 
> TIMEOUT: Tues April 6 telecon
> 
> ======================================================================
> Currently, the odls base functions update mpirun's knowledge regarding the 
> state of local procs by one of two mechanisms:
> 
> 1. if we are a daemon, we send the update info to the HNP via the rml. The 
> HNP receives this in plm_base_receive, which then updates the relevant data 
> objects and calls a function to check for job termination. If the proc 
> abnormally terminated, then that function calls the errmgr to report it.
> 
> 2. if we are the HNP, we post a "message event" that loops back to a 
> different function that updates the proc state and calls the check job 
> termination function.
> 
> The proposed change would cleanup the "update proc state" procedure by 
> removing the "message event" callback in the HNP. This would involve:
> 
> 1. change the errmgr.proc_aborted API to be errmgr.update_proc_state (or pick 
> your name) and add the proc state to the parameter list.
> 
> 2. modify the odls_base_default_fns so that they call 
> errmgr.update_proc_state instead of automatically sending a message back to 
> the HNP or pushing a message event if we already are the HNP whenever a proc 
> changes its state. This would allow us to remove the message event loop from 
> the HNP, which will help with thread issues. It also allows the daemons to do 
> something other than just notify the HNP, which is a motivator for me/Cisco.
> 
> 3. modify plm_base_receive so that messages to update proc state are referred 
> directly to the errmgr.update_proc_state API
> 
> 4. create an orted module for the errmgr that sends an "update proc state" 
> message back to the HNP
> 
> 5. create an HNP module for the errmgr that locally updates the proc state, 
> checks for job termination, etc. The code in the plm base function for that 
> purpose (check_job_complete) would be moved here.
> 
> This shouldn't impact anyone EXCEPT for those out there with branches that 
> might get a conflict due to the errmgr API change. Hopefully that doesn't hit 
> too many people.
> 


Reply via email to