On 8 Nov 2011, at 00:59, George Bosilca wrote:

> A started process is defined as being our mpirun. In Open MPI 
> MPIR_partial_attach_ok is defined, so the tool will suppose that we provide a 
> means to synchronize the processes not based on MPIR_debug_gate. Therefore 
> only one behavior if acceptable based on the text above: no MPIR_debug_gate=1 
> should be issued by the tool.

Open MPI itself (Via ORTE) is not the only possible launch mechanism for Open 
MPI jobs, Slurm is the only other tool I can think of of the top of my head 
that can do it but I wouldn't be surprised if there are others.  At the time 
the document was written it was assumed that the MPI library and resource 
manager/job launcher were so closely integrated they could be assumed to be 
part of the same software.

> However, in the ompi_debuggers.c around line 226, we have an if that switch 
> between the two acceptable behavior (MPIR_debug_gate or own mechanism) based 
> on the fact that we are a standalone (slurmd or generic) or not. As generic 
> is the ess loaded in most of the cases, I can't figure out how this works if 
> the MPIR specification document has to be trusted.

Unless the library can guarantee that the starter process has 
MPIR_partial_attach_ok the only safe thing it can do it wait on 
MPIR_debug_gate, the only way the library can make any guarantees about mpirun 
is if it's launched from orted.

I agree that it's not clear this, I don't think this spec is well understood by 
anyone, indeed it wasn't originally written with the intention of becoming a 
specification at all.  I've looked at it a couple of times but never used this 
aspect of it, padb (and I believe stat is the same) don't ever launch jobs 
under control of the debugger, simply attach to an already existing job which 
means I've been able to ignore this part of the spec in padb entirely.

Ashley.

Reply via email to