Hi Ralph and Josh, >>> Regarding to the schema represented by the picture, I didn't understand the >>> RecoS' behaviour in a node failure situation. >>> >>> In this case, will mpirun consider the daemon failure as a normal proc >>> failure? If it is correct, should mpirun update the global procs state for >>> all jobs running under the failed daemon? >> >> I haven't included the node failure case yet - still on my "to-do" list. In >> brief, the answer is yes/no. :-) >> >> Daemon failure follows the same code path as shown in the flow chart. >> However, it is up to the individual modules to determine a response to that >> failure. The "orcm" RecoS module response is to (a) mark all procs on that >> node as having failed, (b) mark that node as "down" so it won't get reused, >> and (c) remap and restart all such procs on the remaining available nodes, >> starting new daemon(s) as required. >> >> In the orcm environment, nodes that are replaced or rebooted automatically >> start their own daemon. This is detected by orcm, and the node state (if the >> node is rebooted) will automatically be updated to "up" - if it is a new >> node, it is automatically added to the available resources. This allows the >> node to be reused once the problem has been corrected. In other environments >> (ssh, slurm, etc), the node is simply left as "down" as there is no way to >> know if/when the node becomes available again. >> >> If you aren't using the "orcm" module, then the default behavior will abort >> the job. > > Just to echo this response. The orted and process failures use the same error > path, but can be easily differentiated by their jobids. The 'orcm' component > is a good example of differentiating these two fault scenarios to correctly > recover the ORTE job. Soon we may/should/will have the same ability with > certain MPI jobs. :)
Hum... I'm really afraid about this. I understand your choice since it is really a good solution for fail/stop/restart behaviour, but looking from the fail/recovery side, can you envision some alternative for the orted's reconfiguration on the fly? Best regards, Leonardo