A clarification -- this specific issue is during add_procs(), which, for jobs 
that do not use the MPI-2 dynamics, is during MPI_INIT.  The #3 error 
detection/abort is not during the dynamic/lazy MPI peer connection wireup.

The potential for long timeout delays mentioned in #1 would be during the the 
dynamic/lazy MPI peer connection wireup.


On Jun 4, 2010, at 3:07 PM, Ralph Castain wrote:

> I think Rolf's reply makes a possibly bad assumption - i.e., that this 
> problem is occurring just as the job is starting to run. Let me give you a 
> real-life example where this wasn't true, and where aborting the job would 
> make a very unhappy user:
> 
> We start a long-running job (i.e., days) on a very large cluster that has 
> both IB and TCP connections. OMPI correctly places the IB network at highest 
> priority. During the run, the IB connection on a node begins to have trouble. 
> The node itself is fine, and the procs are fine - it is only the comm link 
> that is having problems.
> 
> Yes, we could just abort - but a job this size cannot do checkpoint/restart 
> as the memory consumption is just too large. The app does checkpoint itself 
> (writing critical info to a file) periodically, but the computation is 
> important to complete - rescheduling to get this much of the machine can add 
> weeks of delay.
> 
> So we -really- need to have this job continue running, even at a lower 
> performance, for as long as it possibly can.
> 
> Lest you think this is infrequent - think again. This happens constantly on 
> large IB clusters. IB is flaky, to say the least, and long runs are 
> constantly facing link problems.
> 
> Eventually, we would like the BTL to recover when the IB link is "fixed". 
> Often, this requires the operator to reset the physical connector, or some 
> other operation that can be done while the node is running. So having a 
> mechanism by which jobs can keep running when a BTL connection "fails" for a 
> period of time is a serious requirement.
> 
> 
> On Jun 4, 2010, at 12:47 PM, Rolf vandeVaart wrote:
> 
> > On 06/04/10 11:47, Jeff Squyres wrote:
> >>
> >> On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote:
> >>
> >>  
> >>>> We did assume that at least the errors are symmetric, i.e. if A fails to 
> >>>> connect to B then B will fail when trying to connect to A. However, if 
> >>>> there are other BTL the connection is supposed to smoothly move over 
> >>>> some other BTL. As an example in the MX BTL, if two nodes have MX 
> >>>> support, but they do not share the same mapper the add_procs will 
> >>>> silently fails.
> >>>>      
> >>> This sounds dodgy and icky.  We have to wait for a connect timeout to 
> >>> fail over to the next BTL?  How long is the typical/default TCP timeout?
> >>>    
> >>
> >> After thinking about this more, I still do not think that this is good 
> >> behavior.
> >>
> >> Short version:
> >> --------------
> >>
> >> If a BTL is going to fail, it should do so early in the selection process 
> >> and therefore disqualify itself.  Failing in add_procs() means that it 
> >> lied in the selection process and has created a set of difficult 
> >> implications for the rest of the job.
> >>
> >> Perhaps the best compromise is that there should be an MCA parameter to 
> >> choose between a) the "failover" behavior that George described (i.e., 
> >> wait for the timeout and then have the PML/BML fail over to a 2nd BTL, if 
> >> available), and b) abort the job.
> >>
> >> More details:
> >> -------------
> >>
> >> If a BTL has advertised contact information in the modex but then an error 
> >> in add_procs() causes the BTL to not be able to listen at that advertised 
> >> contact point, I think that this is a very serious error.  I see a few 
> >> options:
> >>
> >> 1. Current behavior supposedly has the PML fail over to another eligible 
> >> BTL.  It's not entirely clear whether this functionality works or not, but 
> >> even if it does, it can cause a lengthy "hang" *potentially for each 
> >> connection* while we're waiting for the timeout before failing over to 
> >> another connection.
> >>
> >> --> IMHO: this behavior just invites user questions and bug reports.  It 
> >> also is potentially quite expensive -- consider that in an MPI_ALLTOALL 
> >> operation, every peer might have a substantial delay before it fails over 
> >> to the secondary BTL.
> >>
> >> 2. When a BTL detects that it cannot honor its advertised contact 
> >> information, either the BTL or the BML can send a modex update to all of 
> >> the process peers, effectively eliminating that contact information.  This 
> >> kind of asynchronous update seems racy and difficult -- could be difficult 
> >> to get this right (indeed, the modex doesn't even currently support an 
> >> after-the-fact update).
> >>
> >> 3. When a BTL detects that it cannot honor its advertised contact 
> >> information, it fails upward to the BML and the BML decides to abort the 
> >> job. 
> >>
> >> I think #2 is a bad idea.  I am leaning towards #3 simply because a BTL 
> >> should not fail by the time it reaches add_procs().  If a BTL is going to 
> >> fail, it should do so and disqualify itself earlier in the selection 
> >> process.  Or, perhaps we can have an MCA parameter to switch between #1 
> >> and #3.
> >>
> >> Or maybe someone can think of a #4 that would be better...?
> >>  
> > I think I like idea #3.  It is simple, explainable, and the job is aborting 
> > just as it is starting to run.  It seems these cases should be infrequent 
> > and may signify something is really wrong, so aborting the job is OK.
> > Rolf
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to