Hi Dan, I agree with you about MPI barriers, but that’s why I said it was a simplified pseudocode. :-) We do have more mechanisms in place for handling the "fence-ness” of the operation, but barriers are a component and I’d like to move to a nonblocking version when possible.
Having more, hopefully all, of the file operations get equivalent nonblocking versions would be _very_ nice, and I could simplify our internal code more if a nonblocking MPI_FILE_SYNC was available. A nonblocking version of MPI_FILE_SET_SIZE would also be high on my list. Yes, I grok the behavior of MPI_FILE_CLOSE, but don’t want to add a barrier on top of it. :-) One new question: Am I allowed to call MPI_COMM_FREE while I have an uncompleted request for a nonblocking collective operation (like MPI_IBARRIER) on the communicator? Will MPI_COMM_FREE block for completion of the NBC op? Thanks! Quincey > On Aug 15, 2020, at 6:07 AM, HOLMES Daniel <d.hol...@epcc.ed.ac.uk> wrote: > > Hi Quincey, > > The MPI barrier operation (whether blocking, nonblocking, or persistent) does > not guarantee “memory fence” semantics (either for the content of memory or > the content of files). > > Perhaps you are looking for MPI_FILE_SYNC? > > "If other processes have made updates to the storage device, then all such > updates become visible to subsequent reads of fh by the calling process.” > §13.6.1 > > "MPI_FILE_SYNC is a collective operation.” §13.6.1 > > Used correctly (user must locally complete their I/O operations before > calling it), this does provide a “fence”-like guarantee *for the file*, which > is what your code looks like you are attempting. That is, all remote writes > to the file that were initiated remotely (and locally completed at the remote > process) before the matching remote call to MPI_FILE_SYNC are guaranteed to > be visible in the file using subsequent locally issued MPI read operations > once the local call to MPI_FILE_SYNC completes locally. > > There is currently no nonblocking or persistent expression of this MPI > procedure - watch this space: this is on the to-do list for MPI-Next. > > As Jim points out, the performance problem you note is most likely due to the > implicit MPI_FILE_SYNC-like synchronisation done internally by MPI during the > MPI_FILE_CLOSE procedure call. All enqueued file operations targeting the > file will be flushed to the file during MPI_FILE_CLOSE. If file operations > are not flushed to the file concurrently with the application stuff or the > MPI communication operations, then they will still be enqueued when > MPI_FILE_CLOSE is called. > > Cheers, > Dan. > — > Dr Daniel Holmes PhD > Architect (HPC Research) > d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> > Phone: +44 (0) 131 651 3465 > Mobile: +44 (0) 7940 524 088 > Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 > 9BT > — > The University of Edinburgh is a charitable body, registered in Scotland, > with registration number SC005336. > — > >> On 14 Aug 2020, at 17:32, Quincey Koziol via mpi-forum >> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> wrote: >> >> Hi Dan, >> I believe that Pavan was referring to my conversation with him about >> MPI_Request_free. Here’s my situation: I’d like to use MPI_Ibarrier as a >> form of “memory fence” between some of the metadata reads and writes in >> HDF5. Here’s some [very] simplified pseudocode for what I’d like to do: >> >> =============================== >> >> <open HDF5 file> // sets up a communicator for internal HDF5 communication >> about this file >> >> do { >> MPI_Ibarrier(<file’s communicator>, &request); >> >> <application stuff> >> >> // HDF5 operation: >> if(<operation is read or write>) { >> MPI_Wait(&request); >> <perform read / write> >> } >> else { // operation is a file close >> MPI_Request_free(&request); >> MPI_File_close(…); >> MPI_Comm_free(<file’s communicator>); >> } >> } while (<file is open>); >> >> =============================== >> >> What I am really trying to avoid is calling MPI_Wait at file close, since it >> is semantically unnecessary and only increases the latency from the >> application’s perspective. If I can’t call MPI_Request_free on a >> nonblocking collective operation’s request (and it looks like I can’t, right >> now), I will have to put the request and file’s communicator into a >> “cleanup” list that is polled periodically [on each rank] with MPI_Test and >> disposed of when the nonblocking barrier completes locally. >> >> So, I’d really like to be able to call MPI_Request_free on the nonblocking >> barrier’s request. >> >> Thoughts? >> >> Quincey >> >> >>> On Aug 13, 2020, at 9:07 AM, HOLMES Daniel via mpi-forum >>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>> wrote: >>> >>> Hi Jim, >>> >>> To be clear, I think that MPI_CANCEL is evil and should be removed from the >>> MPI Standard entirely at the earliest convenience. >>> >>> I am certainly not arguing that it be permitted for more MPI operations. >>> >>> I thought the discussion was focused on MPI_REQUEST_FREE and whether or not >>> it can/should be used on an active request. >>> >>> If a particular MPI implementation does not keep a reference to the request >>> between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference to process >>> the completion event, then that MPI implementation would be required to >>> keep a reference to the request from MPI_REQUEST_FREE until that important >>> task had been done, perhaps until the close epoch call. This requires no >>> new memory because the user is giving up their reference to the request, so >>> MPI can safely use the request it is passed in MPI_REQUEST_FREE without >>> copying it. As you say, MPI takes over the responsibility for processing >>> the completion event. >>> >>> Your question about why the implementation should be required to take on >>> this complexity is a good one. That, I guess, is why freeing any active >>> request is a bad idea. MPI is required to differentiate completion of >>> individual operations (so it can implement MPI_WAIT) but that means >>> something must process completion at some point for each individual >>> operation. In RMA, that responsibility can be discharged earlier than in >>> other parts of the MPI interface, but the real question is “why should MPI >>> offer to take on this responsibility in the first place?” >>> >>> Thanks, that helps (me at least). >>> >>> Cheers, >>> Dan. >>> — >>> Dr Daniel Holmes PhD >>> Architect (HPC Research) >>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> >>> Phone: +44 (0) 131 651 3465 >>> Mobile: +44 (0) 7940 524 088 >>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, >>> EH8 9BT >>> — >>> The University of Edinburgh is a charitable body, registered in Scotland, >>> with registration number SC005336. >>> — >>> >>>> On 13 Aug 2020, at 14:43, Jim Dinan <james.di...@gmail.com >>>> <mailto:james.di...@gmail.com>> wrote: >>>> >>>> The two cases you mentioned would have the same behavior at an application >>>> level. However, there may be important differences in the implementation >>>> of each operation. For example, an MPI_Put operation may be configured to >>>> not generate a completion event, whereas an MPI_Rput would. The library >>>> may be relying on the user to make a call on the request to process the >>>> event and clean up resources. The implementation can take over this >>>> responsibility if the user cancels the request, but why should we ask >>>> implementers to take on this complexity and overhead? >>>> >>>> My $0.02 is that MPI_Cancel is subtle and complicated, and we should be >>>> very careful about where we allow it. I don't see the benefit to the >>>> programming model outweighing the complexity and overhead in the MPI >>>> runtime for the case of MPI_Rput. I also don't know that we were careful >>>> enough in specifying the RMA memory model that a canceled request-based >>>> RMA operation will still have well-defined behavior. My understanding is >>>> that MPI_Cancel is required primarily for canceling receive requests to >>>> meet MPI's quiescent shutdown requirement. >>>> >>>> ~Jim. >>>> >>>> On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum >>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>> wrote: >>>> Hi all, >>>> >>>> To increase my own understanding of RMA, what is the difference (if any) >>>> between a request-based RMA operation where the request is freed without >>>> being completed and before the epoch is closed and a “normal” RMA >>>> operation? >>>> >>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call >>>> doUserWorkBefore() >>>> MPI_RPUT(&req) >>>> MPI_REQUEST_FREE(&req) >>>> doUserWorkAfter() >>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call >>>> >>>> vs: >>>> >>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call >>>> doUserWorkBefore() >>>> MPI_PUT() >>>> doUserWorkAfter() >>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call >>>> >>>> Is this a source-to-source translation that is always safe in either >>>> direction? >>>> >>>> In RMA, in contrast to the rest of MPI, there are two opportunities for >>>> MPI to “block” and do non-local work to complete an RMA operation: 1) >>>> during MPI_WAIT for the request (if any - the user may not be given a >>>> request or the user may choose to free the request without calling >>>> MPI_WAIT or the user might call nonblocking MPI_TEST) and 2) during the >>>> close epoch procedure, which is always permitted to be sufficiently >>>> non-local to guarantee that the RMA operation is complete and its freeing >>>> stage has been done. It seems that a request-based RMA operation becomes >>>> identical to a “normal” RMA operation if the user calls MPI_REQUEST_FREE >>>> on the request. This is like “freeing" the request from a nonblocking >>>> point-to-point operation but without the guarantee of a later >>>> synchronisation procedure that can actually complete the operation and >>>> actually do the freeing stage of the operation. >>>> >>>> In collectives, there is no “ensure all operations so far are now done” >>>> procedure call because there is no concept of epoch for collectives. >>>> In point-to-point, there is no “ensure all operations so far are now done” >>>> procedure call because there is no concept of epoch for point-to-point. >>>> In file operations, there is no “ensure all operations so far are now >>>> done” procedure call because there is no concept of epoch for file >>>> operations. (There is MPI_FILE_SYNC but it is optional so MPI cannot rely >>>> on it being called.) >>>> In these cases, the only non-local procedure that is guaranteed to happen >>>> is MPI_FINALIZE, hence all outstanding non-local work needed by the >>>> “freed” operation might be delayed until that procedure is called. >>>> >>>> The issue with copying parameters is also moot because all of them are >>>> passed-by-value (implicitly copied) or are data-buffers and covered by >>>> “conflicting accesses” RMA rules. >>>> >>>> Thus, to me it seems to me that RMA is a very special case - it could >>>> support different semantics, but that does not provide a good basis for >>>> claiming that the rest of the MPI Standard can support those different >>>> semantics - unless we introduce an epoch concept into the rest of the MPI >>>> Standard. This is not unreasonable: the notifications in GASPI, for >>>> example, guarantee completion of not just the operation they are attached >>>> to but *all* operations issued in the “queue” they represent since the >>>> last notification. Their queue concept serves the purpose of an epoch. I’m >>>> sure there are other examples in other APIs. It seems to me likely that >>>> the proposal for MPI_PSYNC for partitioned communication operations is >>>> moving in the direction of an epoch, although limited to remote completion >>>> of all the partitions in a single operation, which accidentally guarantees >>>> that the operation can be freed locally using a local procedure. >>>> >>>> Cheers, >>>> Dan. >>>> — >>>> Dr Daniel Holmes PhD >>>> Architect (HPC Research) >>>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk> >>>> Phone: +44 (0) 131 651 3465 >>>> Mobile: +44 (0) 7940 524 088 >>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, >>>> EH8 9BT >>>> — >>>> The University of Edinburgh is a charitable body, registered in Scotland, >>>> with registration number SC005336. >>>> — >>>> >>>>> On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum >>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>>> wrote: >>>>> >>>>> FYI, one argument (also used to force us to add restrictions on MPI >>>>> persistent collective initialization to be blocking)... The >>>>> MPI_Request_free on an NBC poses a problem for the cases where there are >>>>> array types >>>>> posed (e.g., Alltoallv/w)... It will not be knowable to the application >>>>> if the vectors are in use by MPI still after >>>>> the free on an active request. We do *not* mandate that the MPI >>>>> implementation copy such arrays currently, so they are effectively "held >>>>> as unfreeable" by the MPI implementation till MPI_Finalize. The user >>>>> cannot deallocate them in a correct program till after MPI_Finalize. >>>>> >>>>> Another effect for NBC of releasing an active request, IMHO, is that you >>>>> don't know when send buffers are free to be deallocated or receive >>>>> buffers are free to be deallocated... since you don't know when the >>>>> transfer is complete OR the buffers are no longer used by MPI (till after >>>>> MPI_Finalize). >>>>> >>>>> Tony >>>>> >>>>> >>>>> >>>>> >>>>> Anthony Skjellum, PhD >>>>> Professor of Computer Science and Chair of Excellence >>>>> Director, SimCenter >>>>> University of Tennessee at Chattanooga (UTC) >>>>> tony-skjel...@utc.edu <mailto:tony-skjel...@utc.edu> [or >>>>> skjel...@gmail.com <mailto:skjel...@gmail.com>] >>>>> cell: 205-807-4968 >>>>> >>>>> From: mpi-forum <mpi-forum-boun...@lists.mpi-forum.org >>>>> <mailto:mpi-forum-boun...@lists.mpi-forum.org>> on behalf of Jeff Hammond >>>>> via mpi-forum <mpi-forum@lists.mpi-forum.org >>>>> <mailto:mpi-forum@lists.mpi-forum.org>> >>>>> Sent: Saturday, August 8, 2020 12:07 PM >>>>> To: Main MPI Forum mailing list <mpi-forum@lists.mpi-forum.org >>>>> <mailto:mpi-forum@lists.mpi-forum.org>> >>>>> Cc: Jeff Hammond <jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>> >>>>> Subject: Re: [Mpi-forum] MPI_Request_free restrictions >>>>> >>>>> We should fix the RMA chapter with an erratum. I care less about NBC but >>>>> share your ignorance of why it was done that way. >>>>> >>>>> Sent from my iPhone >>>>> >>>>>> On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum >>>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> >>>>>> wrote: >>>>>> >>>>>> Folks, >>>>>> >>>>>> Does someone remember why we disallowed users from calling >>>>>> MPI_Request_free on nonblocking collective requests? I remember the >>>>>> reasoning for not allowing cancel (i.e., the operation might have >>>>>> completed on some processes, but not all), but not for Request_free. >>>>>> AFAICT, allowing the users to free the request doesn’t make any >>>>>> difference to the MPI library. The MPI library would simply maintain >>>>>> its own refcount to the request and continue forward till the operation >>>>>> completes. One of our users would like to free NBC requests so they >>>>>> don’t have to wait for the operation to complete in some situations. >>>>>> >>>>>> Unfortunately, when I added the Rput/Rget operations in the RMA chapter, >>>>>> I copy-pasted that text into RMA as well without thinking too hard about >>>>>> it. My bad! Either the RMA committee missed it too, or they thought of >>>>>> a reason that I can’t think of now. >>>>>> >>>>>> Can someone clarify or remind me what the reason was? >>>>>> >>>>>> Regards, >>>>>> >>>>>> — Pavan >>>>>> >>>>>> MPI-3.1 standard, page 197, lines 26-27: >>>>>> >>>>>> “It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request >>>>>> associated with a nonblocking collective operation.” >>>>>> >>>>>> _______________________________________________ >>>>>> mpi-forum mailing list >>>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>>>> _______________________________________________ >>>>> mpi-forum mailing list >>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>>> _______________________________________________ >>>> mpi-forum mailing list >>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >>> >>> _______________________________________________ >>> mpi-forum mailing list >>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum> >> >> _______________________________________________ >> mpi-forum mailing list >> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org> >> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum >
_______________________________________________ mpi-forum mailing list mpi-forum@lists.mpi-forum.org https://lists.mpi-forum.org/mailman/listinfo/mpi-forum