Re: [OMPI devel] RFC: Resilient ORTE

Wesley Bland Wed, 8 Jun 2011 17:38:30 -0400

On Tuesday, June 7, 2011 at 4:55 PM, Josh Hursey wrote:

- orte_errmgr.post_startup() start the persistent RML message. There
does not seem to be a shutdown version of this (to deregister the RML
message at orte_finalize time). Was this intentional, or just missed?


 I just missed that one. I've added that into the code now.

- in the orte_errmgr.set_fault_callback: it would be nice if it
returned the previous callback, so you could layer more than one
'thing' on top of ORTE and have them chain in a sigaction-like manner.

 Again, you are correct. Rather than just returning the previous callback
(if any) I think it makes more sense to maintain a list of callbacks and
have the errmgr call them directly. That way applications/ompi layers don't
have to worry about calling another callback function.

- orte_process_info.max_procs: this seems to be only used in the
binomial routed, but I was a bit unclear about its purpose. Can you
describe what it does, and how it is used?

I use this to determine how many processes were in the job before we started
having failures. This helps me preserve the structure of the tree as much as
possible rather than completely reorganizing the routing layer every time a
process fails.

- in orted_comm.c: you process the ORTE_PROCESS_FAILED_NOTIFICATION
message here. Why not push all of that logic into the errmgr
components? It is not a big deal, just curious.

Most of the actual logic that handles the processing of the error messages
is pushed into the errmgr component. The code you see in orted_comm.c is
almost all parsing and resending the list of dead processes to the
appropriate modules. That code will have to be in there no matter what.

I've updated the code and checked it into a bitbucket repository which can
be found here:

https://bitbucket.org/wesbland/resilient-orte/

Please let me know of any more comments,
Wesley

Re: [OMPI devel] RFC: Resilient ORTE

Reply via email to