The question does essentially boil down to whether a full fan-out of nonblocking send/recv pairs followed by wait-all is a valid implementation of MPI_Barrier. Reviewing the text that Dan cited for MPI 4.0:
> (§5.14, p234 MPI-2019-draft): “A correct, portable program must invoke collective communications so that deadlock will not occur” There isn't any convenient way the user can find out about remote completion of a barrier (short of building their own barrier with synchronous send). So, we can either interpret the above statement to place a strong completion requirement on collectives (bad for performance). Or, we can interpret it to mean that there's really no safe time when a user can call into a blocking external interface. The RMA progress passage that Martin referenced seems to support this latter interpretation with the sockets example given in the rationale. ~Jim. On Mon, Oct 12, 2020 at 11:17 AM HOLMES Daniel <d.hol...@epcc.ed.ac.uk> wrote: > Hi Jim, et al, > > Unless the point-to-point pseudo-code given is proven to be a valid > implementation of MPI Barrier, then reasoning about MPI Barrier using it as > a basis is unlikely to be edifying. > I also have a (possibly flawed) implementation of MPI Barrier that > exhibits some odd semantics/behaviours and I could use that to assert > (likely incorrectly) that MPI Barrier is defined in a way that exhibits > those semantics/behaviours. However, that serves no purpose, so I won’t > dwell on it any further. > > I’m glad that someone responded with a reference to the MPI Standard, > thanks Martin. In that vein, here’s my tuppence: > > The definition of MPI Barrier in the MPI Standard states (§5.3, p149 in > MPI-2019-draft): > “If comm is an intracommunicator, MPI_BARRIER blocks the caller until all > group members have called it. The call returns at any process only after > all group members have entered the call.” > > There is a happens-before between “all MPI processes have entered the MPI > Barrier call” and “MPI processes are permitted to leave the call”. That’s > it; that’s all MPI Barrier does/is required to ensure. > > There is no indication or requirement for alacrity. This appears to be a > valid (although stupid) implementation: > int MPI_Barrier(MPI_comm comm) { > int ret = PMPI_Barrier(comm); > sleep(100days); > return ret; > } > > There is no indication or suggestion for how or when MPI processes become > aware that the necessary pre-condition for returning control to the user > has been satisfied. Some may become aware of this situation a significant > amount of time before/after others. Local completion does not guarantee > remote completion in MPI (except for passive-target RMA, e.g. > MPI_Win_unlock). > > There is no indication or requirement that the necessary pre-condition is > also a sufficient pre-condition, although we may wish to assume that and we > may wish to clarify the wording of the MPI Standard to specify that > explicitly. If the MPI Standard text were changed to “The call returns at > any process <strike>only</strike>immediately after all group members have > entered the call.” then (given the other usage of immediately in the MPI > Standard) we could assume that the procedure becomes strong local > (immediate) once the necessary pre-condition is met. Without the word > “immediate” in the sentence, the return of the MPI procedure is permitted > to require remote progress, i.e. after the necessary pre-condition is met, > it becomes weak local (called local in the MPI Standard). Some MPI > libraries (can, if configured in a particular way) provide strong progress; > however, MPI only requires weak progress. Weak progress means it is > permitted for remote progress to happen only during remote MPI procedure > calls. > > So, > > If MPI required “returns immediately after...” (which it does not) then > every MPI process would be required to ensure the remote completion of its > “send” (as well as local completion of the “recv”) before it returns > control to the user. This would mean that our intuitive feel for what > MPI_Barrier should do would be correct and the suggested point-to-point > code would be an incorrect implementation of MPI_Barrier. > If MPI required strong progress (which it does not) then every MPI process > would eventually become aware that it is permitted to return control to the > user, without additional remote MPI procedure calls. This would mean that > our intuitive feel for what MPI_Barrier should do would be correct and the > suggested point-to-point code would be a correct implementation of > MPI_Barrier. > > As it is, our intuitive feel for what MPI_Barrier should do is probably > wrong (i.e. not what MPI actually specifies), or at least too optimistic > because it depends on a high quality implementation that exceeds what is > minimally specified by the MPI Standard as required. > As it is, the MPI_Barrier in the original question does not guard against > problems with the non-MPI file operations - indeed, adding it introduces a > new possibility of a deadlock, which would not be present in the code > without the MPI_Barrier operation. > > I would argue that the original code is therefore erroneous > (incorrect/non-portable) because (§5.14, p234 MPI-2019-draft): > “A correct, portable program must invoke collective communications so that > deadlock will not occur” > > One correct program that achieves what the original looks like it might be > trying to achieve (IHMO) is as follows: > if (rank == 1) > create_file("test”); > MPI_Barrier(); > if (rank == 0) > while not_exists("test") > sleep(1); > This program still assumes that the file creation actually creates the > file and flushes it to a filesystem that makes it visible to the existence > check but that must be true if the code-without-MPI is correct, i.e. adding > MPI has not introduced a new problem to the code. > > Taking this reasoning about the minimal requirements of MPI Barrier (at > least) one step too far, the only restriction on implementation of > MPI_Barrier seems to be “do not return until <something happens>”, which > suggests this is a valid (although very unhelpful) implementation: > int MPI_Barrier(MPI_comm comm) { > while (1); // do not return, ever > } > > To guard against low-quality/malicious implementations of the MPI > Standard, we could either clarify the wording of the text about MPI_Barrier > (and probably the text about every other MPI procedure) to include the > concept of becoming an “immediate” procedure once certain criteria are met > (likely to be a lot of effort/angst for some), or mandate strong progress > for all MPI libraries (likely to be very unpopular for some). > > Cheers, > Dan. > — > Dr Daniel Holmes PhD > Architect (HPC Research) > d.hol...@epcc.ed.ac.uk > Phone: +44 (0) 131 651 3465 > Mobile: +44 (0) 7940 524 088 > Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, > EH8 9BT > — > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > — > > On 12 Oct 2020, at 10:04, Martin Schulz via mpi-forum < > mpi-forum@lists.mpi-forum.org> wrote: > > Hi Jim, all, > > We had a similar discussion (in a smaller circle) during the terms > discussions – at least to my understanding, all bets are off as soon as you > add dependencies and wait conditions outside of MPI, like here with the > file. A note to this point is in a rational (Section 11.7, page 491 in the > 2019 draft) – based on that an MPI implementation is allowed to deadlock > (or cause a deadlock) – if all dependencies would be in MPI calls, then > “eventual” progress should be guaranteed – even if it is after the 100 days > in Rajeev’s example: that would – as far as I understand – still be correct > behavior, as no MPI call is guaranteed to return in a fixed finite time > (all calls are at best “weak local”). > > Martin > > > > -- > Prof. Dr. Martin Schulz, Chair of Computer Architecture and Parallel > Systems > Department of Informatics, TU-Munich, Boltzmannstraße 3, D-85748 Garching > Member of the Board of Directors at the Leibniz Supercomputing Centre (LRZ) > Email: schu...@in.tum.de > > > > *From: *mpi-forum <mpi-forum-boun...@lists.mpi-forum.org> on behalf of > Jim Dinan via mpi-forum <mpi-forum@lists.mpi-forum.org> > *Reply-To: *Main MPI Forum mailing list <mpi-forum@lists.mpi-forum.org> > *Date: *Sunday, 11. October 2020 at 23:41 > *To: *"Skjellum, Anthony" <tony-skjel...@utc.edu> > *Cc: *Jim Dinan <james.di...@gmail.com>, Main MPI Forum mailing list < > mpi-forum@lists.mpi-forum.org> > *Subject: *Re: [Mpi-forum] [EXT]: Progress Question > > You can have a situation where the isend/irecv pair completes at process 0 > before process 1 has called irecv or waitall. Since process 0 is now busy > waiting on the file, it will not make progress on MPI calls and can result > in deadlock. > > ~Jim. > > On Sat, Oct 10, 2020 at 2:17 PM Skjellum, Anthony <tony-skjel...@utc.edu> > wrote: > > Jim, OK, my attempt at answering below. > > See if you agree with my annotations. > > -Tony > > > Anthony Skjellum, PhD > Professor of Computer Science and Chair of Excellence > Director, SimCenter > University of Tennessee at Chattanooga (UTC) > tony-skjel...@utc.edu [or skjel...@gmail.com] > cell: 205-807-4968 > > > ------------------------------ > *From:* mpi-forum <mpi-forum-boun...@lists.mpi-forum.org> on behalf of > Jim Dinan via mpi-forum <mpi-forum@lists.mpi-forum.org> > *Sent:* Saturday, October 10, 2020 1:31 PM > *To:* Main MPI Forum mailing list <mpi-forum@lists.mpi-forum.org> > *Cc:* Jim Dinan <james.di...@gmail.com> > *Subject:* [EXT]: [Mpi-forum] Progress Question > > *External Email* > Hi All, > > A colleague recently asked a question that I wasn't able to answer > definitively. Is the following code guaranteed to make progress? > > > MPI_Barrier(); > -- everything is uncertain to within one message, if layered on pt2pt; > --- let's assume a power of 2, and recursive doubling (RD). > --- At each stage, it posts an irecv and isend to its corresponding > element in RD > --- All stages must complete to get to the last stage. > --- At the last stage, it appears like your example below for N/2 > independent process pairs, which appears always to complete. > Oif rank == 1 > create_file("test") > if rank == 0 > while not_exists("test") > sleep(1); > > > That is, can rank 1 require rank 0 to make MPI calls after its return from > the barrier, in order for rank 1 to complete the barrier? If the code were > written as follows: > > > isend(..., other_rank, &req[0]) > irecv(..., other_rank, &req[1]) > waitall(2, req) > --- Assume both isends buffer on the send-side and return > immediately--valid. > --- Both irecvs are posted, but unmatched as yet. Nothing has transferred > on network. > --- Waitall would mark the isends done at once, and work to complete the > irecvs; in > that process, each would have to progress the isends across the > network. On this comm > and all comms, incidentally. > --- When waitall returns, the data has transferred to the receiver, > otherwise the irecvs > aren't done. > if rank == 1 > create_file("test") > if rank == 0 > while not_exists("test") > sleep(1); > > > I think it would clearly not guarantee progress since the send data can be > buffered. Is the same true for barrier? > > Cheers, > ~Jim. > *This message is not from a UTC.EDU <http://utc.edu/> address. Caution > should be used in clicking links and downloading attachments from unknown > senders or unexpected email. * > > > _______________________________________________ > mpi-forum mailing list > mpi-forum@lists.mpi-forum.org > https://lists.mpi-forum.org/mailman/listinfo/mpi-forum > > >
_______________________________________________ mpi-forum mailing list mpi-forum@lists.mpi-forum.org https://lists.mpi-forum.org/mailman/listinfo/mpi-forum