On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote:

> Having errors on add_procs stop the application seems a good thing in all
> cases, so why not do it ? That would solve my original problem which lead
> to this discussion.
> 
> Yes, the openib BTL may be suboptimal (stopping the application instead of
> nicely returning), but I'm fine with that, so I'm not very inclined to
> spend time on this.

Herein lies the quandary: we don't/can't know the user or sysadmin intent.  
They may not care if the IB is borked -- they might just want the job to fall 
over to TCP and continue.  But they may care a lot if IB is borked -- they 
might want the job to abort (because it would be too slow over TCP).

So I don't think it's a good idea to always abort if a single BTL is busted.  
The typical Open MPI Way is to introduce an MCA parameter that lets the user / 
sysadmin choose which behavior they want.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to