We talked about this on the call today.  This proposal was generally accepted.  
I put more details on the following two tickets:

https://svn.open-mpi.org/trac/ompi/ticket/2429
https://svn.open-mpi.org/trac/ompi/ticket/2438



On Jun 4, 2010, at 11:47 AM, Jeff Squyres (jsquyres) wrote:

> On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote:
> 
> > > We did assume that at least the errors are symmetric, i.e. if A fails to 
> > > connect to B then B will fail when trying to connect to A. However, if 
> > > there are other BTL the connection is supposed to smoothly move over some 
> > > other BTL. As an example in the MX BTL, if two nodes have MX support, but 
> > > they do not share the same mapper the add_procs will silently fails.
> >
> > This sounds dodgy and icky.  We have to wait for a connect timeout to fail 
> > over to the next BTL?  How long is the typical/default TCP timeout?
> 
> After thinking about this more, I still do not think that this is good 
> behavior.
> 
> Short version:
> --------------
> 
> If a BTL is going to fail, it should do so early in the selection process and 
> therefore disqualify itself.  Failing in add_procs() means that it lied in 
> the selection process and has created a set of difficult implications for the 
> rest of the job.
> 
> Perhaps the best compromise is that there should be an MCA parameter to 
> choose between a) the "failover" behavior that George described (i.e., wait 
> for the timeout and then have the PML/BML fail over to a 2nd BTL, if 
> available), and b) abort the job.
> 
> More details:
> -------------
> 
> If a BTL has advertised contact information in the modex but then an error in 
> add_procs() causes the BTL to not be able to listen at that advertised 
> contact point, I think that this is a very serious error.  I see a few 
> options:
> 
> 1. Current behavior supposedly has the PML fail over to another eligible BTL. 
>  It's not entirely clear whether this functionality works or not, but even if 
> it does, it can cause a lengthy "hang" *potentially for each connection* 
> while we're waiting for the timeout before failing over to another connection.
> 
> --> IMHO: this behavior just invites user questions and bug reports.  It also 
> is potentially quite expensive -- consider that in an MPI_ALLTOALL operation, 
> every peer might have a substantial delay before it fails over to the 
> secondary BTL.
> 
> 2. When a BTL detects that it cannot honor its advertised contact 
> information, either the BTL or the BML can send a modex update to all of the 
> process peers, effectively eliminating that contact information.  This kind 
> of asynchronous update seems racy and difficult -- could be difficult to get 
> this right (indeed, the modex doesn't even currently support an 
> after-the-fact update).
> 
> 3. When a BTL detects that it cannot honor its advertised contact 
> information, it fails upward to the BML and the BML decides to abort the job. 
> 
> I think #2 is a bad idea.  I am leaning towards #3 simply because a BTL 
> should not fail by the time it reaches add_procs().  If a BTL is going to 
> fail, it should do so and disqualify itself earlier in the selection process. 
>  Or, perhaps we can have an MCA parameter to switch between #1 and #3.
> 
> Or maybe someone can think of a #4 that would be better...?
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to