A clarification -- this specific issue is during add_procs(), which, for jobs that do not use the MPI-2 dynamics, is during MPI_INIT. The #3 error detection/abort is not during the dynamic/lazy MPI peer connection wireup.
The potential for long timeout delays mentioned in #1 would be during the the dynamic/lazy MPI peer connection wireup. On Jun 4, 2010, at 3:07 PM, Ralph Castain wrote: > I think Rolf's reply makes a possibly bad assumption - i.e., that this > problem is occurring just as the job is starting to run. Let me give you a > real-life example where this wasn't true, and where aborting the job would > make a very unhappy user: > > We start a long-running job (i.e., days) on a very large cluster that has > both IB and TCP connections. OMPI correctly places the IB network at highest > priority. During the run, the IB connection on a node begins to have trouble. > The node itself is fine, and the procs are fine - it is only the comm link > that is having problems. > > Yes, we could just abort - but a job this size cannot do checkpoint/restart > as the memory consumption is just too large. The app does checkpoint itself > (writing critical info to a file) periodically, but the computation is > important to complete - rescheduling to get this much of the machine can add > weeks of delay. > > So we -really- need to have this job continue running, even at a lower > performance, for as long as it possibly can. > > Lest you think this is infrequent - think again. This happens constantly on > large IB clusters. IB is flaky, to say the least, and long runs are > constantly facing link problems. > > Eventually, we would like the BTL to recover when the IB link is "fixed". > Often, this requires the operator to reset the physical connector, or some > other operation that can be done while the node is running. So having a > mechanism by which jobs can keep running when a BTL connection "fails" for a > period of time is a serious requirement. > > > On Jun 4, 2010, at 12:47 PM, Rolf vandeVaart wrote: > > > On 06/04/10 11:47, Jeff Squyres wrote: > >> > >> On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote: > >> > >> > >>>> We did assume that at least the errors are symmetric, i.e. if A fails to > >>>> connect to B then B will fail when trying to connect to A. However, if > >>>> there are other BTL the connection is supposed to smoothly move over > >>>> some other BTL. As an example in the MX BTL, if two nodes have MX > >>>> support, but they do not share the same mapper the add_procs will > >>>> silently fails. > >>>> > >>> This sounds dodgy and icky. We have to wait for a connect timeout to > >>> fail over to the next BTL? How long is the typical/default TCP timeout? > >>> > >> > >> After thinking about this more, I still do not think that this is good > >> behavior. > >> > >> Short version: > >> -------------- > >> > >> If a BTL is going to fail, it should do so early in the selection process > >> and therefore disqualify itself. Failing in add_procs() means that it > >> lied in the selection process and has created a set of difficult > >> implications for the rest of the job. > >> > >> Perhaps the best compromise is that there should be an MCA parameter to > >> choose between a) the "failover" behavior that George described (i.e., > >> wait for the timeout and then have the PML/BML fail over to a 2nd BTL, if > >> available), and b) abort the job. > >> > >> More details: > >> ------------- > >> > >> If a BTL has advertised contact information in the modex but then an error > >> in add_procs() causes the BTL to not be able to listen at that advertised > >> contact point, I think that this is a very serious error. I see a few > >> options: > >> > >> 1. Current behavior supposedly has the PML fail over to another eligible > >> BTL. It's not entirely clear whether this functionality works or not, but > >> even if it does, it can cause a lengthy "hang" *potentially for each > >> connection* while we're waiting for the timeout before failing over to > >> another connection. > >> > >> --> IMHO: this behavior just invites user questions and bug reports. It > >> also is potentially quite expensive -- consider that in an MPI_ALLTOALL > >> operation, every peer might have a substantial delay before it fails over > >> to the secondary BTL. > >> > >> 2. When a BTL detects that it cannot honor its advertised contact > >> information, either the BTL or the BML can send a modex update to all of > >> the process peers, effectively eliminating that contact information. This > >> kind of asynchronous update seems racy and difficult -- could be difficult > >> to get this right (indeed, the modex doesn't even currently support an > >> after-the-fact update). > >> > >> 3. When a BTL detects that it cannot honor its advertised contact > >> information, it fails upward to the BML and the BML decides to abort the > >> job. > >> > >> I think #2 is a bad idea. I am leaning towards #3 simply because a BTL > >> should not fail by the time it reaches add_procs(). If a BTL is going to > >> fail, it should do so and disqualify itself earlier in the selection > >> process. Or, perhaps we can have an MCA parameter to switch between #1 > >> and #3. > >> > >> Or maybe someone can think of a #4 that would be better...? > >> > > I think I like idea #3. It is simple, explainable, and the job is aborting > > just as it is starting to run. It seems these cases should be infrequent > > and may signify something is really wrong, so aborting the job is OK. > > Rolf > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/