Hi Dan, Ah, very useful to know, thanks! Is there a nonblocking version of MPI_COMM_DISCONNECT? (I’ve searched the web for MPI_COMM_IDISCONNECT and it comes up empty, but that’s not canonical :-)
If not, can a capability like this be added to any “wish lists”? Ideally, calling something like MPI_COMM_IDISCONNECT and then having the request for that operation complete would mean that MPI_COMM_FREE would be guaranteed to be both nonblocking and complete locally. Thoughts? Quincey > On Aug 21, 2020, at 10:01 AM, HOLMES Daniel <d.hol...@epcc.ed.ac.uk> wrote: > > Hi Quincey, > > Calling MPI_COMM_FREE when some requests representing nonblocking or > persistent operations are still active is not prohibited by MPI and seems to > work successfully in all the MPI libraries I’ve tested. > > The normative description for MPI_COMM_FREE in the MPI Standard specifically > calls out that it will only mark the communicator for freeing later and may > return to the user before pending/ongoing communication is complete. It does > not require that the completion procedure has been called for active > operations. > > We discussed in the Forum (as recently as the meeting this week) that this is > a key difference between MPI_COMM_FREE and MPI_COMM_DISCONNECT - the latter > states that the user is required to call the completion procedure(s) for all > operations using a communicator before disconnecting it using > MPI_COMM_DISCONNECT, which will wait for all pending communication to > complete internally. > > OTOH, I’m not sure that doing this buys you as much as you think it might. > > MPI_COMM_FREE is a collective procedure, so it is permitted to wait until > MPI_COMM_FREE has been called at all other MPI processes in the communicator, > i.e. it can have blocking-barrier-like semantics. All collective operations > must be initialised in the same order at all processes in the communicator. > So a valid implementation could do all the pending work inside MPI_COMM_FREE > but the Standard also permits an implementation that does nothing other than > change a “ready-for-freeing” flag on the local communicator object. > >> Am I allowed to call MPI_COMM_FREE while I have an uncompleted request for a >> nonblocking collective operation (like MPI_IBARRIER) on the communicator? > > Yes. > >> Will MPI_COMM_FREE block for completion of the NBC op? > > No. > > Cheers, > Dan. > — > Dr Daniel Holmes PhD > Architect (HPC Research) > d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> > Phone: +44 (0) 131 651 3465 > Mobile: +44 (0) 7940 524 088 > Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 > 9BT > — > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > — > >> On 21 Aug 2020, at 15:26, Quincey Koziol via mpi-forum >> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> wrote: >> >> Hi Dan, >> I agree with you about MPI barriers, but that’s why I said it was a >> simplified pseudocode. :-) We do have more mechanisms in place for >> handling the "fence-ness” of the operation, but barriers are a component and >> I’d like to move to a nonblocking version when possible. >> >> Having more, hopefully all, of the file operations get equivalent >> nonblocking versions would be _very_ nice, and I could simplify our internal >> code more if a nonblocking MPI_FILE_SYNC was available. A nonblocking >> version of MPI_FILE_SET_SIZE would also be high on my list. >> >> Yes, I grok the behavior of MPI_FILE_CLOSE, but don’t want to add a barrier >> on top of it. :-) >> >> >> One new question: Am I allowed to call MPI_COMM_FREE while I have an >> uncompleted request for a nonblocking collective operation (like >> MPI_IBARRIER) on the communicator? Will MPI_COMM_FREE block for completion >> of the NBC op? >> >> >> Thanks! >> Quincey >> >> >> >>> On Aug 15, 2020, at 6:07 AM, HOLMES Daniel <d.hol...@epcc.ed.ac.uk >>> <mailto:d.hol...@epcc.ed.ac.uk>> wrote: >>> >>> Hi Quincey, >>> >>> The MPI barrier operation (whether blocking, nonblocking, or persistent) >>> does not guarantee “memory fence” semantics (either for the content of >>> memory or the content of files). >>> >>> Perhaps you are looking for MPI_FILE_SYNC? >>> >>> "If other processes have made updates to the storage device, then all such >>> updates become visible to subsequent reads of fh by the calling process.” >>> §13.6.1 >>> >>> "MPI_FILE_SYNC is a collective operation.” §13.6.1 >>> >>> Used correctly (user must locally complete their I/O operations before >>> calling it), this does provide a “fence”-like guarantee *for the file*, >>> which is what your code looks like you are attempting. That is, all remote >>> writes to the file that were initiated remotely (and locally completed at >>> the remote process) before the matching remote call to MPI_FILE_SYNC are >>> guaranteed to be visible in the file using subsequent locally issued MPI >>> read operations once the local call to MPI_FILE_SYNC completes locally. >>> >>> There is currently no nonblocking or persistent expression of this MPI >>> procedure - watch this space: this is on the to-do list for MPI-Next. >>> >>> As Jim points out, the performance problem you note is most likely due to >>> the implicit MPI_FILE_SYNC-like synchronisation done internally by MPI >>> during the MPI_FILE_CLOSE procedure call. All enqueued file operations >>> targeting the file will be flushed to the file during MPI_FILE_CLOSE. If >>> file operations are not flushed to the file concurrently with the >>> application stuff or the MPI communication operations, then they will still >>> be enqueued when MPI_FILE_CLOSE is called. >>> >>> Cheers, >>> Dan. >>> — >>> Dr Daniel Holmes PhD >>> Architect (HPC Research) >>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> >>> Phone: +44 (0) 131 651 3465 >>> Mobile: +44 (0) 7940 524 088 >>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, >>> EH8 9BT >>> — >>> The University of Edinburgh is a charitable body, registered in Scotland, >>> with registration number SC005336. >>> — >>> >>>> On 14 Aug 2020, at 17:32, Quincey Koziol via mpi-forum >>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>> wrote: >>>> >>>> Hi Dan, >>>> I believe that Pavan was referring to my conversation with him about >>>> MPI_Request_free. Here’s my situation: I’d like to use MPI_Ibarrier as a >>>> form of “memory fence” between some of the metadata reads and writes in >>>> HDF5. Here’s some [very] simplified pseudocode for what I’d like to do: >>>> >>>> =============================== >>>> >>>> <open HDF5 file> // sets up a communicator for internal HDF5 >>>> communication about this file >>>> >>>> do { >>>> MPI_Ibarrier(<file’s communicator>, &request); >>>> >>>> <application stuff> >>>> >>>> // HDF5 operation: >>>> if(<operation is read or write>) { >>>> MPI_Wait(&request); >>>> <perform read / write> >>>> } >>>> else { // operation is a file close >>>> MPI_Request_free(&request); >>>> MPI_File_close(…); >>>> MPI_Comm_free(<file’s communicator>); >>>> } >>>> } while (<file is open>); >>>> >>>> =============================== >>>> >>>> What I am really trying to avoid is calling MPI_Wait at file close, since >>>> it is semantically unnecessary and only increases the latency from the >>>> application’s perspective. If I can’t call MPI_Request_free on a >>>> nonblocking collective operation’s request (and it looks like I can’t, >>>> right now), I will have to put the request and file’s communicator into a >>>> “cleanup” list that is polled periodically [on each rank] with MPI_Test >>>> and disposed of when the nonblocking barrier completes locally. >>>> >>>> So, I’d really like to be able to call MPI_Request_free on the nonblocking >>>> barrier’s request. >>>> >>>> Thoughts? >>>> >>>> Quincey >>>> >>>> >>>>> On Aug 13, 2020, at 9:07 AM, HOLMES Daniel via mpi-forum >>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>>> wrote: >>>>> >>>>> Hi Jim, >>>>> >>>>> To be clear, I think that MPI_CANCEL is evil and should be removed from >>>>> the MPI Standard entirely at the earliest convenience. >>>>> >>>>> I am certainly not arguing that it be permitted for more MPI operations. >>>>> >>>>> I thought the discussion was focused on MPI_REQUEST_FREE and whether or >>>>> not it can/should be used on an active request. >>>>> >>>>> If a particular MPI implementation does not keep a reference to the >>>>> request between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference >>>>> to process the completion event, then that MPI implementation would be >>>>> required to keep a reference to the request from MPI_REQUEST_FREE until >>>>> that important task had been done, perhaps until the close epoch call. >>>>> This requires no new memory because the user is giving up their reference >>>>> to the request, so MPI can safely use the request it is passed in >>>>> MPI_REQUEST_FREE without copying it. As you say, MPI takes over the >>>>> responsibility for processing the completion event. >>>>> >>>>> Your question about why the implementation should be required to take on >>>>> this complexity is a good one. That, I guess, is why freeing any active >>>>> request is a bad idea. MPI is required to differentiate completion of >>>>> individual operations (so it can implement MPI_WAIT) but that means >>>>> something must process completion at some point for each individual >>>>> operation. In RMA, that responsibility can be discharged earlier than in >>>>> other parts of the MPI interface, but the real question is “why should >>>>> MPI offer to take on this responsibility in the first place?” >>>>> >>>>> Thanks, that helps (me at least). >>>>> >>>>> Cheers, >>>>> Dan. >>>>> — >>>>> Dr Daniel Holmes PhD >>>>> Architect (HPC Research) >>>>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> >>>>> Phone: +44 (0) 131 651 3465 >>>>> Mobile: +44 (0) 7940 524 088 >>>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, >>>>> EH8 9BT >>>>> — >>>>> The University of Edinburgh is a charitable body, registered in Scotland, >>>>> with registration number SC005336. >>>>> — >>>>> >>>>>> On 13 Aug 2020, at 14:43, Jim Dinan <james.di...@gmail.com >>>>>> <mailto:james.di...@gmail.com>> wrote: >>>>>> >>>>>> The two cases you mentioned would have the same behavior at an >>>>>> application level. However, there may be important differences in the >>>>>> implementation of each operation. For example, an MPI_Put operation may >>>>>> be configured to not generate a completion event, whereas an MPI_Rput >>>>>> would. The library may be relying on the user to make a call on the >>>>>> request to process the event and clean up resources. The implementation >>>>>> can take over this responsibility if the user cancels the request, but >>>>>> why should we ask implementers to take on this complexity and overhead? >>>>>> >>>>>> My $0.02 is that MPI_Cancel is subtle and complicated, and we should be >>>>>> very careful about where we allow it. I don't see the benefit to the >>>>>> programming model outweighing the complexity and overhead in the MPI >>>>>> runtime for the case of MPI_Rput. I also don't know that we were careful >>>>>> enough in specifying the RMA memory model that a canceled request-based >>>>>> RMA operation will still have well-defined behavior. My understanding is >>>>>> that MPI_Cancel is required primarily for canceling receive requests to >>>>>> meet MPI's quiescent shutdown requirement. >>>>>> >>>>>> ~Jim. >>>>>> >>>>>> On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum >>>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>>>> wrote: >>>>>> Hi all, >>>>>> >>>>>> To increase my own understanding of RMA, what is the difference (if any) >>>>>> between a request-based RMA operation where the request is freed without >>>>>> being completed and before the epoch is closed and a “normal” RMA >>>>>> operation? >>>>>> >>>>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call >>>>>> doUserWorkBefore() >>>>>> MPI_RPUT(&req) >>>>>> MPI_REQUEST_FREE(&req) >>>>>> doUserWorkAfter() >>>>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call >>>>>> >>>>>> vs: >>>>>> >>>>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call >>>>>> doUserWorkBefore() >>>>>> MPI_PUT() >>>>>> doUserWorkAfter() >>>>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call >>>>>> >>>>>> Is this a source-to-source translation that is always safe in either >>>>>> direction? >>>>>> >>>>>> In RMA, in contrast to the rest of MPI, there are two opportunities for >>>>>> MPI to “block” and do non-local work to complete an RMA operation: 1) >>>>>> during MPI_WAIT for the request (if any - the user may not be given a >>>>>> request or the user may choose to free the request without calling >>>>>> MPI_WAIT or the user might call nonblocking MPI_TEST) and 2) during the >>>>>> close epoch procedure, which is always permitted to be sufficiently >>>>>> non-local to guarantee that the RMA operation is complete and its >>>>>> freeing stage has been done. It seems that a request-based RMA operation >>>>>> becomes identical to a “normal” RMA operation if the user calls >>>>>> MPI_REQUEST_FREE on the request. This is like “freeing" the request from >>>>>> a nonblocking point-to-point operation but without the guarantee of a >>>>>> later synchronisation procedure that can actually complete the operation >>>>>> and actually do the freeing stage of the operation. >>>>>> >>>>>> In collectives, there is no “ensure all operations so far are now done” >>>>>> procedure call because there is no concept of epoch for collectives. >>>>>> In point-to-point, there is no “ensure all operations so far are now >>>>>> done” procedure call because there is no concept of epoch for >>>>>> point-to-point. >>>>>> In file operations, there is no “ensure all operations so far are now >>>>>> done” procedure call because there is no concept of epoch for file >>>>>> operations. (There is MPI_FILE_SYNC but it is optional so MPI cannot >>>>>> rely on it being called.) >>>>>> In these cases, the only non-local procedure that is guaranteed to >>>>>> happen is MPI_FINALIZE, hence all outstanding non-local work needed by >>>>>> the “freed” operation might be delayed until that procedure is called. >>>>>> >>>>>> The issue with copying parameters is also moot because all of them are >>>>>> passed-by-value (implicitly copied) or are data-buffers and covered by >>>>>> “conflicting accesses” RMA rules. >>>>>> >>>>>> Thus, to me it seems to me that RMA is a very special case - it could >>>>>> support different semantics, but that does not provide a good basis for >>>>>> claiming that the rest of the MPI Standard can support those different >>>>>> semantics - unless we introduce an epoch concept into the rest of the >>>>>> MPI Standard. This is not unreasonable: the notifications in GASPI, for >>>>>> example, guarantee completion of not just the operation they are >>>>>> attached to but *all* operations issued in the “queue” they represent >>>>>> since the last notification. Their queue concept serves the purpose of >>>>>> an epoch. I’m sure there are other examples in other APIs. It seems to >>>>>> me likely that the proposal for MPI_PSYNC for partitioned communication >>>>>> operations is moving in the direction of an epoch, although limited to >>>>>> remote completion of all the partitions in a single operation, which >>>>>> accidentally guarantees that the operation can be freed locally using a >>>>>> local procedure. >>>>>> >>>>>> Cheers, >>>>>> Dan. >>>>>> — >>>>>> Dr Daniel Holmes PhD >>>>>> Architect (HPC Research) >>>>>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> >>>>>> Phone: +44 (0) 131 651 3465 >>>>>> Mobile: +44 (0) 7940 524 088 >>>>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, >>>>>> EH8 9BT >>>>>> — >>>>>> The University of Edinburgh is a charitable body, registered in >>>>>> Scotland, with registration number SC005336. >>>>>> — >>>>>> >>>>>>> On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum >>>>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>>>>> wrote: >>>>>>> >>>>>>> FYI, one argument (also used to force us to add restrictions on MPI >>>>>>> persistent collective initialization to be blocking)... The >>>>>>> MPI_Request_free on an NBC poses a problem for the cases where there >>>>>>> are array types >>>>>>> posed (e.g., Alltoallv/w)... It will not be knowable to the application >>>>>>> if the vectors are in use by MPI still after >>>>>>> the free on an active request. We do *not* mandate that the MPI >>>>>>> implementation copy such arrays currently, so they are effectively >>>>>>> "held as unfreeable" by the MPI implementation till MPI_Finalize. The >>>>>>> user cannot deallocate them in a correct program till after >>>>>>> MPI_Finalize. >>>>>>> >>>>>>> Another effect for NBC of releasing an active request, IMHO, is that >>>>>>> you don't know when send buffers are free to be deallocated or receive >>>>>>> buffers are free to be deallocated... since you don't know when the >>>>>>> transfer is complete OR the buffers are no longer used by MPI (till >>>>>>> after MPI_Finalize). >>>>>>> >>>>>>> Tony >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> Anthony Skjellum, PhD >>>>>>> Professor of Computer Science and Chair of Excellence >>>>>>> Director, SimCenter >>>>>>> University of Tennessee at Chattanooga (UTC) >>>>>>> tony-skjel...@utc.edu <mailto:tony-skjel...@utc.edu> [or >>>>>>> skjel...@gmail.com <mailto:skjel...@gmail.com>] >>>>>>> cell: 205-807-4968 >>>>>>> >>>>>>> From: mpi-forum <mpi-forum-boun...@lists.mpi-forum.org >>>>>>> <mailto:mpi-forum-boun...@lists.mpi-forum.org>> on behalf of Jeff >>>>>>> Hammond via mpi-forum <mpi-forum@lists.mpi-forum.org >>>>>>> <mailto:mpi-forum@lists.mpi-forum.org>> >>>>>>> Sent: Saturday, August 8, 2020 12:07 PM >>>>>>> To: Main MPI Forum mailing list <mpi-forum@lists.mpi-forum.org >>>>>>> <mailto:mpi-forum@lists.mpi-forum.org>> >>>>>>> Cc: Jeff Hammond <jeff.scie...@gmail.com >>>>>>> <mailto:jeff.scie...@gmail.com>> >>>>>>> Subject: Re: [Mpi-forum] MPI_Request_free restrictions >>>>>>> >>>>>>> We should fix the RMA chapter with an erratum. I care less about NBC >>>>>>> but share your ignorance of why it was done that way. >>>>>>> >>>>>>> Sent from my iPhone >>>>>>> >>>>>>>> On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum >>>>>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Folks, >>>>>>>> >>>>>>>> Does someone remember why we disallowed users from calling >>>>>>>> MPI_Request_free on nonblocking collective requests? I remember the >>>>>>>> reasoning for not allowing cancel (i.e., the operation might have >>>>>>>> completed on some processes, but not all), but not for Request_free. >>>>>>>> AFAICT, allowing the users to free the request doesn’t make any >>>>>>>> difference to the MPI library. The MPI library would simply maintain >>>>>>>> its own refcount to the request and continue forward till the >>>>>>>> operation completes. One of our users would like to free NBC requests >>>>>>>> so they don’t have to wait for the operation to complete in some >>>>>>>> situations. >>>>>>>> >>>>>>>> Unfortunately, when I added the Rput/Rget operations in the RMA >>>>>>>> chapter, I copy-pasted that text into RMA as well without thinking too >>>>>>>> hard about it. My bad! Either the RMA committee missed it too, or >>>>>>>> they thought of a reason that I can’t think of now. >>>>>>>> >>>>>>>> Can someone clarify or remind me what the reason was? >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> — Pavan >>>>>>>> >>>>>>>> MPI-3.1 standard, page 197, lines 26-27: >>>>>>>> >>>>>>>> “It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request >>>>>>>> associated with a nonblocking collective operation.” >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> mpi-forum mailing list >>>>>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>>>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>>>>>> _______________________________________________ >>>>>>> mpi-forum mailing list >>>>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>>>>> _______________________________________________ >>>>>> mpi-forum mailing list >>>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>>>> >>>>> _______________________________________________ >>>>> mpi-forum mailing list >>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>>> >>>> _______________________________________________ >>>> mpi-forum mailing list >>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>> >> >> _______________________________________________ >> mpi-forum mailing list >> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >
_______________________________________________ mpi-forum mailing list mpi-forum@lists.mpi-forum.org https://lists.mpi-forum.org/mailman/listinfo/mpi-forum