Good point. I'll only add one more comment/question before crawling back under my rock.
If you have a failover capability, why would you turn it "off" during a particular phase of the program? Imagine you are a user and your job has sat in the queue for an entire day. Just as it finally starts to run, a glitch hits the IB connection on one of the nodes you were assigned - the link was up during the initial setup, but then drops during the modex. Instead of riding thru and letting your job run, OMPI aborts because it happened at the beginning of the job instead of a split-second later. So now my job doesn't run, and I'm back to waiting in the queue again - even though the job -could- have run, and OMPI had the ability to let it do so. Note that a similar issue in the OOB would -not- cause the job to fail. Seems weird to me...but it is in the MPI layer, so feel free to do whatever you like! :-) On Jun 4, 2010, at 6:09 PM, Jeff Squyres wrote: > A clarification -- this specific issue is during add_procs(), which, for jobs > that do not use the MPI-2 dynamics, is during MPI_INIT. The #3 error > detection/abort is not during the dynamic/lazy MPI peer connection wireup. > > The potential for long timeout delays mentioned in #1 would be during the the > dynamic/lazy MPI peer connection wireup. > > > On Jun 4, 2010, at 3:07 PM, Ralph Castain wrote: > >> I think Rolf's reply makes a possibly bad assumption - i.e., that this >> problem is occurring just as the job is starting to run. Let me give you a >> real-life example where this wasn't true, and where aborting the job would >> make a very unhappy user: >> >> We start a long-running job (i.e., days) on a very large cluster that has >> both IB and TCP connections. OMPI correctly places the IB network at highest >> priority. During the run, the IB connection on a node begins to have >> trouble. The node itself is fine, and the procs are fine - it is only the >> comm link that is having problems. >> >> Yes, we could just abort - but a job this size cannot do checkpoint/restart >> as the memory consumption is just too large. The app does checkpoint itself >> (writing critical info to a file) periodically, but the computation is >> important to complete - rescheduling to get this much of the machine can add >> weeks of delay. >> >> So we -really- need to have this job continue running, even at a lower >> performance, for as long as it possibly can. >> >> Lest you think this is infrequent - think again. This happens constantly on >> large IB clusters. IB is flaky, to say the least, and long runs are >> constantly facing link problems. >> >> Eventually, we would like the BTL to recover when the IB link is "fixed". >> Often, this requires the operator to reset the physical connector, or some >> other operation that can be done while the node is running. So having a >> mechanism by which jobs can keep running when a BTL connection "fails" for a >> period of time is a serious requirement. >> >> >> On Jun 4, 2010, at 12:47 PM, Rolf vandeVaart wrote: >> >>> On 06/04/10 11:47, Jeff Squyres wrote: >>>> >>>> On Jun 2, 2010, at 1:36 PM, Jeff Squyres (jsquyres) wrote: >>>> >>>> >>>>>> We did assume that at least the errors are symmetric, i.e. if A fails to >>>>>> connect to B then B will fail when trying to connect to A. However, if >>>>>> there are other BTL the connection is supposed to smoothly move over >>>>>> some other BTL. As an example in the MX BTL, if two nodes have MX >>>>>> support, but they do not share the same mapper the add_procs will >>>>>> silently fails. >>>>>> >>>>> This sounds dodgy and icky. We have to wait for a connect timeout to >>>>> fail over to the next BTL? How long is the typical/default TCP timeout? >>>>> >>>> >>>> After thinking about this more, I still do not think that this is good >>>> behavior. >>>> >>>> Short version: >>>> -------------- >>>> >>>> If a BTL is going to fail, it should do so early in the selection process >>>> and therefore disqualify itself. Failing in add_procs() means that it >>>> lied in the selection process and has created a set of difficult >>>> implications for the rest of the job. >>>> >>>> Perhaps the best compromise is that there should be an MCA parameter to >>>> choose between a) the "failover" behavior that George described (i.e., >>>> wait for the timeout and then have the PML/BML fail over to a 2nd BTL, if >>>> available), and b) abort the job. >>>> >>>> More details: >>>> ------------- >>>> >>>> If a BTL has advertised contact information in the modex but then an error >>>> in add_procs() causes the BTL to not be able to listen at that advertised >>>> contact point, I think that this is a very serious error. I see a few >>>> options: >>>> >>>> 1. Current behavior supposedly has the PML fail over to another eligible >>>> BTL. It's not entirely clear whether this functionality works or not, but >>>> even if it does, it can cause a lengthy "hang" *potentially for each >>>> connection* while we're waiting for the timeout before failing over to >>>> another connection. >>>> >>>> --> IMHO: this behavior just invites user questions and bug reports. It >>>> also is potentially quite expensive -- consider that in an MPI_ALLTOALL >>>> operation, every peer might have a substantial delay before it fails over >>>> to the secondary BTL. >>>> >>>> 2. When a BTL detects that it cannot honor its advertised contact >>>> information, either the BTL or the BML can send a modex update to all of >>>> the process peers, effectively eliminating that contact information. This >>>> kind of asynchronous update seems racy and difficult -- could be difficult >>>> to get this right (indeed, the modex doesn't even currently support an >>>> after-the-fact update). >>>> >>>> 3. When a BTL detects that it cannot honor its advertised contact >>>> information, it fails upward to the BML and the BML decides to abort the >>>> job. >>>> >>>> I think #2 is a bad idea. I am leaning towards #3 simply because a BTL >>>> should not fail by the time it reaches add_procs(). If a BTL is going to >>>> fail, it should do so and disqualify itself earlier in the selection >>>> process. Or, perhaps we can have an MCA parameter to switch between #1 >>>> and #3. >>>> >>>> Or maybe someone can think of a #4 that would be better...? >>>> >>> I think I like idea #3. It is simple, explainable, and the job is aborting >>> just as it is starting to run. It seems these cases should be infrequent >>> and may signify something is really wrong, so aborting the job is OK. >>> Rolf >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel