On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote: - orte_errmgr.post_startup() start the persistent RML message. There does not seem to be a shutdown version of this (to deregister the RML message at orte_finalize time). Was this intentional, or just missed?
I just missed that one. I've added that into the code now. - in the orte_errmgr.set_fault_callback: it would be nice if it returned the previous callback, so you could layer more than one 'thing' on top of ORTE and have them chain in a sigaction-like manner. Again, you are correct. Rather than just returning the previous callback (if any) I think it makes more sense to maintain a list of callbacks and have the errmgr call them directly. That way applications/ompi layers don't have to worry about calling another callback function. - orte_process_info.max_procs: this seems to be only used in the binomial routed, but I was a bit unclear about its purpose. Can you describe what it does, and how it is used? I use this to determine how many processes were in the job before we started having failures. This helps me preserve the structure of the tree as much as possible rather than completely reorganizing the routing layer every time a process fails. - in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION message here. Why not push all of that logic into the errmgr components? It is not a big deal, just curious. Most of the actual logic that handles the processing of the error messages is pushed into the errmgr component. The code you see in orted_comm.c is almost all parsing and resending the list of dead processes to the appropriate modules. That code will have to be in there no matter what. I've updated the code and checked it into a bitbucket repository which can be found here: https://bitbucket.org/wesbland/resilient-orte/ Please let me know of any more comments, Wesley