On May 28, 2010, at 6:04 AM, Sylvain Jeaugey wrote: > Having errors on add_procs stop the application seems a good thing in all > cases, so why not do it ? That would solve my original problem which lead > to this discussion. > > Yes, the openib BTL may be suboptimal (stopping the application instead of > nicely returning), but I'm fine with that, so I'm not very inclined to > spend time on this.
Herein lies the quandary: we don't/can't know the user or sysadmin intent. They may not care if the IB is borked -- they might just want the job to fall over to TCP and continue. But they may care a lot if IB is borked -- they might want the job to abort (because it would be too slow over TCP). So I don't think it's a good idea to always abort if a single BTL is busted. The typical Open MPI Way is to introduce an MCA parameter that lets the user / sysadmin choose which behavior they want. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/