Re: [Mpi-forum] MPI_Request_free restrictions

Quincey Koziol via mpi-forum Fri, 21 Aug 2020 07:28:42 -0700

Hi Dan,
        I agree with you about MPI barriers, but that’s why I said it was a 
simplified pseudocode.  :-)   We do have more mechanisms in place for handling 
the "fence-ness” of the operation, but barriers are a component and I’d like to 
move to a nonblocking version when possible.


        Having more, hopefully all, of the file operations get equivalent 
nonblocking versions would be _very_ nice, and I could simplify our internal 
code more if a nonblocking MPI_FILE_SYNC was available.   A nonblocking version 
of MPI_FILE_SET_SIZE would also be high on my list.

        Yes, I grok the behavior of MPI_FILE_CLOSE, but don’t want to add a 
barrier on top of it. :-)


        One new question:  Am I allowed to call MPI_COMM_FREE while I have an 
uncompleted request for a nonblocking collective operation (like MPI_IBARRIER) 
on the communicator?  Will MPI_COMM_FREE block for completion of the NBC op?


        Thanks!
                Quincey



> On Aug 15, 2020, at 6:07 AM, HOLMES Daniel <d.hol...@epcc.ed.ac.uk> wrote:
> 
> Hi Quincey,
> 
> The MPI barrier operation (whether blocking, nonblocking, or persistent) does 
> not guarantee “memory fence” semantics (either for the content of memory or 
> the content of files).
> 
> Perhaps you are looking for MPI_FILE_SYNC?
> 
> "If other processes have made updates to the storage device, then all such 
> updates become visible to subsequent reads of fh by the calling process.” 
> §13.6.1
> 
> "MPI_FILE_SYNC is a collective operation.” §13.6.1
> 
> Used correctly (user must locally complete their I/O operations before 
> calling it), this does provide a “fence”-like guarantee *for the file*, which 
> is what your code looks like you are attempting. That is, all remote writes 
> to the file that were initiated remotely (and locally completed at the remote 
> process) before the matching remote call to MPI_FILE_SYNC are guaranteed to 
> be visible in the file using subsequent locally issued MPI read operations 
> once the local call to MPI_FILE_SYNC completes locally.
> 
> There is currently no nonblocking or persistent expression of this MPI 
> procedure - watch this space: this is on the to-do list for MPI-Next.
> 
> As Jim points out, the performance problem you note is most likely due to the 
> implicit MPI_FILE_SYNC-like synchronisation done internally by MPI during the 
> MPI_FILE_CLOSE procedure call. All enqueued file operations targeting the 
> file will be flushed to the file during MPI_FILE_CLOSE. If file operations 
> are not flushed to the file concurrently with the application stuff or the 
> MPI communication operations, then they will still be enqueued when 
> MPI_FILE_CLOSE is called.
> 
> Cheers,
> Dan.
> —
> Dr Daniel Holmes PhD
> Architect (HPC Research)
> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk>
> Phone: +44 (0) 131 651 3465
> Mobile: +44 (0) 7940 524 088
> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 
> 9BT
> —
> The University of Edinburgh is a charitable body, registered in Scotland, 
> with registration number SC005336.
> —
> 
>> On 14 Aug 2020, at 17:32, Quincey Koziol via mpi-forum 
>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> wrote:
>> 
>> Hi Dan,
>> I believe that Pavan was referring to my conversation with him about 
>> MPI_Request_free.  Here’s my situation: I’d like to use MPI_Ibarrier as a 
>> form of “memory fence” between some of the metadata reads and writes in 
>> HDF5.   Here’s some [very] simplified pseudocode for what I’d like to do:
>> 
>> ===============================
>> 
>> <open HDF5 file>   // sets up a communicator for internal HDF5 communication 
>> about this file
>> 
>> do {
>> MPI_Ibarrier(<file’s communicator>, &request);
>> 
>> <application stuff>
>> 
>> // HDF5 operation:
>> if(<operation is read or write>) {
>> MPI_Wait(&request);
>> <perform read / write>
>> }
>> else {  // operation is a file close
>> MPI_Request_free(&request);
>> MPI_File_close(…);
>> MPI_Comm_free(<file’s communicator>);
>> }
>> } while (<file is open>);
>> 
>> ===============================
>> 
>> What I am really trying to avoid is calling MPI_Wait at file close, since it 
>> is semantically unnecessary and only increases the latency from the 
>> application’s perspective.   If I can’t call MPI_Request_free on a 
>> nonblocking collective operation’s request (and it looks like I can’t, right 
>> now), I will have to put the request and file’s communicator into a 
>> “cleanup” list that is polled periodically [on each rank] with MPI_Test and 
>> disposed of when the nonblocking barrier completes locally.
>> 
>> So, I’d really like to be able to call MPI_Request_free on the nonblocking 
>> barrier’s request.
>> 
>> Thoughts?
>> 
>> Quincey
>> 
>> 
>>> On Aug 13, 2020, at 9:07 AM, HOLMES Daniel via mpi-forum 
>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> 
>>> wrote:
>>> 
>>> Hi Jim,
>>> 
>>> To be clear, I think that MPI_CANCEL is evil and should be removed from the 
>>> MPI Standard entirely at the earliest convenience.
>>> 
>>> I am certainly not arguing that it be permitted for more MPI operations.
>>> 
>>> I thought the discussion was focused on MPI_REQUEST_FREE and whether or not 
>>> it can/should be used on an active request.
>>> 
>>> If a particular MPI implementation does not keep a reference to the request 
>>> between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference to process 
>>> the completion event, then that MPI implementation would be required to 
>>> keep a reference to the request from MPI_REQUEST_FREE until that important 
>>> task had been done, perhaps until the close epoch call. This requires no 
>>> new memory because the user is giving up their reference to the request, so 
>>> MPI can safely use the request it is passed in MPI_REQUEST_FREE without 
>>> copying it. As you say, MPI takes over the responsibility for processing 
>>> the completion event.
>>> 
>>> Your question about why the implementation should be required to take on 
>>> this complexity is a good one. That, I guess, is why freeing any active 
>>> request is a bad idea. MPI is required to differentiate completion of 
>>> individual operations (so it can implement MPI_WAIT) but that means 
>>> something must process completion at some point for each individual 
>>> operation. In RMA, that responsibility can be discharged earlier than in 
>>> other parts of the MPI interface, but the real question is “why should MPI 
>>> offer to take on this responsibility in the first place?”
>>> 
>>> Thanks, that helps (me at least).
>>> 
>>> Cheers,
>>> Dan.
>>> —
>>> Dr Daniel Holmes PhD
>>> Architect (HPC Research)
>>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk>
>>> Phone: +44 (0) 131 651 3465
>>> Mobile: +44 (0) 7940 524 088
>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, 
>>> EH8 9BT
>>> —
>>> The University of Edinburgh is a charitable body, registered in Scotland, 
>>> with registration number SC005336.
>>> —
>>> 
>>>> On 13 Aug 2020, at 14:43, Jim Dinan <james.di...@gmail.com 
>>>> <mailto:james.di...@gmail.com>> wrote:
>>>> 
>>>> The two cases you mentioned would have the same behavior at an application 
>>>> level. However, there may be important differences in the implementation 
>>>> of each operation. For example, an MPI_Put operation may be configured to 
>>>> not generate a completion event, whereas an MPI_Rput would. The library 
>>>> may be relying on the user to make a call on the request to process the 
>>>> event and clean up resources. The implementation can take over this 
>>>> responsibility if the user cancels the request, but why should we ask 
>>>> implementers to take on this complexity and overhead?
>>>> 
>>>> My $0.02 is that MPI_Cancel is subtle and complicated, and we should be 
>>>> very careful about where we allow it. I don't see the benefit to the 
>>>> programming model outweighing the complexity and overhead in the MPI 
>>>> runtime for the case of MPI_Rput. I also don't know that we were careful 
>>>> enough in specifying the RMA memory model that a canceled request-based 
>>>> RMA operation will still have well-defined behavior. My understanding is 
>>>> that MPI_Cancel is required primarily for canceling receive requests to 
>>>> meet MPI's quiescent shutdown requirement.
>>>> 
>>>>  ~Jim.
>>>> 
>>>> On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum 
>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> 
>>>> wrote:
>>>> Hi all,
>>>> 
>>>> To increase my own understanding of RMA, what is the difference (if any) 
>>>> between a request-based RMA operation where the request is freed without 
>>>> being completed and before the epoch is closed and a “normal” RMA 
>>>> operation?
>>>> 
>>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call
>>>> doUserWorkBefore()
>>>> MPI_RPUT(&req)
>>>> MPI_REQUEST_FREE(&req)
>>>> doUserWorkAfter()
>>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call
>>>> 
>>>> vs:
>>>> 
>>>> MPI_LOCK() ! or any other "open epoch at origin" procedure call
>>>> doUserWorkBefore()
>>>> MPI_PUT()
>>>> doUserWorkAfter()
>>>> MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call
>>>> 
>>>> Is this a source-to-source translation that is always safe in either 
>>>> direction?
>>>> 
>>>> In RMA, in contrast to the rest of MPI, there are two opportunities for 
>>>> MPI to “block” and do non-local work to complete an RMA operation: 1) 
>>>> during MPI_WAIT for the request (if any - the user may not be given a 
>>>> request or the user may choose to free the request without calling 
>>>> MPI_WAIT or the user might call nonblocking MPI_TEST) and 2) during the 
>>>> close epoch procedure, which is always permitted to be sufficiently 
>>>> non-local to guarantee that the RMA operation is complete and its freeing 
>>>> stage has been done. It seems that a request-based RMA operation becomes 
>>>> identical to a “normal” RMA operation if the user calls MPI_REQUEST_FREE 
>>>> on the request. This is like “freeing" the request from a nonblocking 
>>>> point-to-point operation but without the guarantee of a later 
>>>> synchronisation procedure that can actually complete the operation and 
>>>> actually do the freeing stage of the operation.
>>>> 
>>>> In collectives, there is no “ensure all operations so far are now done” 
>>>> procedure call because there is no concept of epoch for collectives.
>>>> In point-to-point, there is no “ensure all operations so far are now done” 
>>>> procedure call because there is no concept of epoch for point-to-point.
>>>> In file operations, there is no “ensure all operations so far are now 
>>>> done” procedure call because there is no concept of epoch for file 
>>>> operations. (There is MPI_FILE_SYNC but it is optional so MPI cannot rely 
>>>> on it being called.)
>>>> In these cases, the only non-local procedure that is guaranteed to happen 
>>>> is MPI_FINALIZE, hence all outstanding non-local work needed by the 
>>>> “freed” operation might be delayed until that procedure is called.
>>>> 
>>>> The issue with copying parameters is also moot because all of them are 
>>>> passed-by-value (implicitly copied) or are data-buffers and covered by 
>>>> “conflicting accesses” RMA rules.
>>>> 
>>>> Thus, to me it seems to me that RMA is a very special case - it could 
>>>> support different semantics, but that does not provide a good basis for 
>>>> claiming that the rest of the MPI Standard can support those different 
>>>> semantics - unless we introduce an epoch concept into the rest of the MPI 
>>>> Standard. This is not unreasonable: the notifications in GASPI, for 
>>>> example, guarantee completion of not just the operation they are attached 
>>>> to but *all* operations issued in the “queue” they represent since the 
>>>> last notification. Their queue concept serves the purpose of an epoch. I’m 
>>>> sure there are other examples in other APIs. It seems to me likely that 
>>>> the proposal for MPI_PSYNC for partitioned communication operations is 
>>>> moving in the direction of an epoch, although limited to remote completion 
>>>> of all the partitions in a single operation, which accidentally guarantees 
>>>> that the operation can be freed locally using a local procedure.
>>>> 
>>>> Cheers,
>>>> Dan.
>>>> —
>>>> Dr Daniel Holmes PhD
>>>> Architect (HPC Research)
>>>> d.hol...@epcc.ed.ac.uk <mailto:d.hol...@epcc.ed.ac.uk>
>>>> Phone: +44 (0) 131 651 3465
>>>> Mobile: +44 (0) 7940 524 088
>>>> Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, 
>>>> EH8 9BT
>>>> —
>>>> The University of Edinburgh is a charitable body, registered in Scotland, 
>>>> with registration number SC005336.
>>>> —
>>>> 
>>>>> On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum 
>>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> 
>>>>> wrote:
>>>>> 
>>>>> FYI, one argument (also used to force us to add restrictions on MPI 
>>>>> persistent collective initialization to be blocking)... The 
>>>>> MPI_Request_free on an NBC poses a problem for the cases where there are 
>>>>> array types
>>>>> posed (e.g., Alltoallv/w)... It will not be knowable to the application 
>>>>> if the vectors are in use by MPI still after 
>>>>> the  free on an active request.  We do *not* mandate that the MPI 
>>>>> implementation copy such arrays currently, so they are effectively "held 
>>>>> as unfreeable" by the MPI implementation till MPI_Finalize.  The user 
>>>>> cannot deallocate them in a correct program till after MPI_Finalize.  
>>>>> 
>>>>> Another effect for NBC of releasing an active request, IMHO,  is that you 
>>>>> don't know when send buffers are free to be deallocated or receive 
>>>>> buffers are free to be deallocated... since you don't know when the 
>>>>> transfer is complete OR the buffers are no longer used by MPI (till after 
>>>>> MPI_Finalize).
>>>>> 
>>>>> Tony
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Anthony Skjellum, PhD
>>>>> Professor of Computer Science and Chair of Excellence
>>>>> Director, SimCenter
>>>>> University of Tennessee at Chattanooga (UTC)
>>>>> tony-skjel...@utc.edu <mailto:tony-skjel...@utc.edu>  [or 
>>>>> skjel...@gmail.com <mailto:skjel...@gmail.com>]
>>>>> cell: 205-807-4968
>>>>> 
>>>>> From: mpi-forum <mpi-forum-boun...@lists.mpi-forum.org 
>>>>> <mailto:mpi-forum-boun...@lists.mpi-forum.org>> on behalf of Jeff Hammond 
>>>>> via mpi-forum <mpi-forum@lists.mpi-forum.org 
>>>>> <mailto:mpi-forum@lists.mpi-forum.org>>
>>>>> Sent: Saturday, August 8, 2020 12:07 PM
>>>>> To: Main MPI Forum mailing list <mpi-forum@lists.mpi-forum.org 
>>>>> <mailto:mpi-forum@lists.mpi-forum.org>>
>>>>> Cc: Jeff Hammond <jeff.scie...@gmail.com <mailto:jeff.scie...@gmail.com>>
>>>>> Subject: Re: [Mpi-forum] MPI_Request_free restrictions
>>>>>  
>>>>> We should fix the RMA chapter with an erratum. I care less about NBC but 
>>>>> share your ignorance of why it was done that way. 
>>>>> 
>>>>> Sent from my iPhone
>>>>> 
>>>>>> On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum 
>>>>>> <mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>> 
>>>>>> wrote:
>>>>>> 
>>>>>>  Folks,
>>>>>> 
>>>>>> Does someone remember why we disallowed users from calling 
>>>>>> MPI_Request_free on nonblocking collective requests?  I remember the 
>>>>>> reasoning for not allowing cancel (i.e., the operation might have 
>>>>>> completed on some processes, but not all), but not for Request_free.  
>>>>>> AFAICT, allowing the users to free the request doesn’t make any 
>>>>>> difference to the MPI library.  The MPI library would simply maintain 
>>>>>> its own refcount to the request and continue forward till the operation 
>>>>>> completes.  One of our users would like to free NBC requests so they 
>>>>>> don’t have to wait for the operation to complete in some situations.
>>>>>> 
>>>>>> Unfortunately, when I added the Rput/Rget operations in the RMA chapter, 
>>>>>> I copy-pasted that text into RMA as well without thinking too hard about 
>>>>>> it.  My bad!  Either the RMA committee missed it too, or they thought of 
>>>>>> a reason that I can’t think of now.
>>>>>> 
>>>>>> Can someone clarify or remind me what the reason was?
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>>   — Pavan
>>>>>> 
>>>>>> MPI-3.1 standard, page 197, lines 26-27:
>>>>>> 
>>>>>> “It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request 
>>>>>> associated with a nonblocking collective operation.”
>>>>>> 
>>>>>> _______________________________________________
>>>>>> mpi-forum mailing list
>>>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>
>>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum 
>>>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>>> _______________________________________________
>>>>> mpi-forum mailing list
>>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>
>>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum 
>>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>>> _______________________________________________
>>>> mpi-forum mailing list
>>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>
>>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum 
>>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>>> 
>>> _______________________________________________
>>> mpi-forum mailing list
>>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>
>>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum 
>>> <https://lists.mpi-forum.org/mailman/listinfo/mpi-forum>
>> 
>> _______________________________________________
>> mpi-forum mailing list
>> mpi-forum@lists.mpi-forum.org <mailto:mpi-forum@lists.mpi-forum.org>
>> https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
>

_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

Re: [Mpi-forum] MPI_Request_free restrictions

Reply via email to