Couldn't explain it better. Thanks Jeff for the summary !
On Tue, 1 Jun 2010, Jeff Squyres wrote:
On May 31, 2010, at 10:27 AM, Ralph Castain wrote:
Just curious - your proposed fix sounds exactly like what was done in
the OPAL SOS work. Are you therefore proposing to use SOS to provide a
more informational status return?
No, I think Sylvain's talking about slightly modifying the existing
mechanism:
1. Return OMPI_SUCCESS: bml then obeys whatever is in the connectivity
bitmask -- even if the bitmask indicates that this BTL can't talk to
anyone.
2. Return != OMPI_SUCCESS: treat the problem as a fatal error.
I think Sylvain's point is that OMPI_SUCCESS can be returned for
non-fatal errors if a BTL just wants to be ignored. In such cases, the
BTL can just set its connectivity mask to 0. This will allow OMPI to
continue the job.
For example, if verbs is borked (e.g., can't create CQ's), it can return
a connectivity mask of 0 and OMPI_SUCCESS. The BML is then free to fail
over to some other BTL.
But if a malloc() fails down in some BTL, then the job is hosed anyway
-- so why not return != OMPI_SUCCESS and try to abort cleanly?
For sites that want to treat verbs failures as fatal, we can add a new
MCA param either in the openib BTL that says "treat all init failures as
fatal to the job" or perhaps a new MCA param in R2 that says "if the
connectivity map for BTL <list> is empty, abort the job". Or something
like that.
If so, then it would seem the only real dispute here is: is there -any-
condition whereby a given BTL should have the authority to tell OMPI to
terminate an application, even if other BTLs could still function?
I think his cited example was if malloc() fails.
I could see some sites wanting to abort if their high-speed network was
down (e.g., MX or openib BTLs failed to init) -- they wouldn't want OMPI
to fail over to TCP. The flip side of this argument is that the
sysadmin could set "btl = ^tcp" in the system file, and then if
openib/mx fails, the BML will abort because some peers won't be
reachable.
I understand that the current code may not yet support that operation,
but I do believe that was the intent of the design. So only the case
where -all- BTLs say "I can't do it" would result in termination.
Rather than change that design, I believe the intent is to work towards
completing that implementation. In the interim, it would seem most
sensible to me that we add an MCA param that specifies the termination
behavior (i.e., attempt to continue or terminate on first fatal BTL
error).
Agreed.
I think that there are multiple different exit conditions from a BTL
init:
1. BTL succeeded in initializing, and some peers are reachable 2. BTL
succeeded in initializing, and no peers are reachable 3. BTL failed to
initialize, but that failure is localized to the BTL (e.g., openib
failed to create a CQ) 4. BTL failed to initialize, and the error is
global in nature (e.g., malloc() fail)
I think it might be a site-specific decision as to whether to abort the
job for condition 3 or not. Today we default to not failing and pair
that with an indirect method of failing (i.e., setting btl=^tcp).
--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel