Hi Ralph and Josh,

>>> Regarding to the schema represented by the picture, I didn't understand the 
>>> RecoS' behaviour in a node failure situation.
>>> 
>>> In this case, will mpirun consider the daemon failure as a normal proc 
>>> failure? If it is correct, should mpirun update the global procs state for 
>>> all jobs running under the failed daemon?
>> 
>> I haven't included the node failure case yet - still on my "to-do" list. In 
>> brief, the answer is yes/no. :-)
>> 
>> Daemon failure follows the same code path as shown in the flow chart. 
>> However, it is up to the individual modules to determine a response to that 
>> failure. The "orcm" RecoS module response is to (a) mark all procs on that 
>> node as having failed, (b) mark that node as "down" so it won't get reused, 
>> and (c) remap and restart all such procs on the remaining available nodes, 
>> starting new daemon(s) as required.
>> 
>> In the orcm environment, nodes that are replaced or rebooted automatically 
>> start their own daemon. This is detected by orcm, and the node state (if the 
>> node is rebooted) will automatically be updated to "up" - if it is a new 
>> node, it is automatically added to the available resources. This allows the 
>> node to be reused once the problem has been corrected. In other environments 
>> (ssh, slurm, etc), the node is simply left as "down" as there is no way to 
>> know if/when the node becomes available again.
>> 
>> If you aren't using the "orcm" module, then the default behavior will abort 
>> the job.
> 
> Just to echo this response. The orted and process failures use the same error 
> path, but can be easily differentiated by their jobids. The 'orcm' component 
> is a good example of differentiating these two fault scenarios to correctly 
> recover the ORTE job. Soon we may/should/will have the same ability with 
> certain MPI jobs. :)

Hum... I'm really afraid about this. I understand your choice since it is 
really a good solution for fail/stop/restart behaviour, but looking from the 
fail/recovery side, can you envision some alternative for the orted's 
reconfiguration on the fly?

Best regards,
Leonardo

Reply via email to