Re: [Mpi-forum] MPI_Request_free restrictions

Bangalore, Purushotham via mpi-forum Fri, 21 Aug 2020 08:48:40 -0700

Hi Quincey,

There is no MPI_COMM_IDISCONNECT currently, but there is was an effort to add 
that, you can see the discussion here:


https://github.com/mpi-forum/mpi-issues/issues/81
https://github.com/mpi-forum/mpi-issues/issues/78

====
Puri
________________________________
From: mpi-forum <mpi-forum-boun...@lists.mpi-forum.org> on behalf of Quincey 
Koziol via mpi-forum <mpi-forum@lists.mpi-forum.org>
Sent: Friday, August 21, 2020 10:19 AM
To: MPI Forum <mpi-forum@lists.mpi-forum.org>
Cc: Quincey Koziol <koz...@lbl.gov>
Subject: Re: [Mpi-forum] MPI_Request_free restrictions

Hi Dan,
Ah, very useful to know, thanks!   Is there a nonblocking version of 
MPI_COMM_DISCONNECT? (I’ve searched the web for MPI_COMM_IDISCONNECT and it 
comes up empty, but that’s not canonical :-)

If not, can a capability like this be added to any “wish lists”?   Ideally, 
calling something like MPI_COMM_IDISCONNECT and then having the request for 
that operation complete would mean that MPI_COMM_FREE would be guaranteed to be 
both nonblocking and complete locally.   Thoughts?

Quincey


On Aug 21, 2020, at 10:01 AM, HOLMES Daniel 
<d.hol...@epcc.ed.ac.uk<mailto:d.hol...@epcc.ed.ac.uk>> wrote:

Hi Quincey,

Calling MPI_COMM_FREE when some requests representing nonblocking or persistent 
operations are still active is not prohibited by MPI and seems to work 
successfully in all the MPI libraries I’ve tested.

The normative description for MPI_COMM_FREE in the MPI Standard specifically 
calls out that it will only mark the communicator for freeing later and may 
return to the user before pending/ongoing communication is complete. It does 
not require that the completion procedure has been called for active operations.

We discussed in the Forum (as recently as the meeting this week) that this is a 
key difference between MPI_COMM_FREE and MPI_COMM_DISCONNECT - the latter 
states that the user is required to call the completion procedure(s) for all 
operations using a communicator before disconnecting it using 
MPI_COMM_DISCONNECT, which will wait for all pending communication to complete 
internally.

OTOH, I’m not sure that doing this buys you as much as you think it might.

MPI_COMM_FREE is a collective procedure, so it is permitted to wait until 
MPI_COMM_FREE has been called at all other MPI processes in the communicator, 
i.e. it can have blocking-barrier-like semantics. All collective operations 
must be initialised in the same order at all processes in the communicator. So 
a valid implementation could do all the pending work inside MPI_COMM_FREE but 
the Standard also permits an implementation that does nothing other than change 
a “ready-for-freeing” flag on the local communicator object.

Am I allowed to call MPI_COMM_FREE while I have an uncompleted request for a 
nonblocking collective operation (like MPI_IBARRIER) on the communicator?

Yes.

  Will MPI_COMM_FREE block for completion of the NBC op?

No.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.hol...@epcc.ed.ac.uk<mailto:d.hol...@epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
—

On 21 Aug 2020, at 15:26, Quincey Koziol via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>> wrote:

Hi Dan,
I agree with you about MPI barriers, but that’s why I said it was a simplified 
pseudocode.  :-)   We do have more mechanisms in place for handling the 
"fence-ness” of the operation, but barriers are a component and I’d like to 
move to a nonblocking version when possible.

Having more, hopefully all, of the file operations get equivalent nonblocking 
versions would be _very_ nice, and I could simplify our internal code more if a 
nonblocking MPI_FILE_SYNC was available.   A nonblocking version of 
MPI_FILE_SET_SIZE would also be high on my list.

Yes, I grok the behavior of MPI_FILE_CLOSE, but don’t want to add a barrier on 
top of it. :-)


One new question:  Am I allowed to call MPI_COMM_FREE while I have an 
uncompleted request for a nonblocking collective operation (like MPI_IBARRIER) 
on the communicator?  Will MPI_COMM_FREE block for completion of the NBC op?


Thanks!
Quincey



On Aug 15, 2020, at 6:07 AM, HOLMES Daniel 
<d.hol...@epcc.ed.ac.uk<mailto:d.hol...@epcc.ed.ac.uk>> wrote:

Hi Quincey,

The MPI barrier operation (whether blocking, nonblocking, or persistent) does 
not guarantee “memory fence” semantics (either for the content of memory or the 
content of files).

Perhaps you are looking for MPI_FILE_SYNC?

"If other processes have made updates to the storage device, then all such 
updates become visible to subsequent reads of fh by the calling process.” 
§13.6.1

"MPI_FILE_SYNC is a collective operation.” §13.6.1

Used correctly (user must locally complete their I/O operations before calling 
it), this does provide a “fence”-like guarantee *for the file*, which is what 
your code looks like you are attempting. That is, all remote writes to the file 
that were initiated remotely (and locally completed at the remote process) 
before the matching remote call to MPI_FILE_SYNC are guaranteed to be visible 
in the file using subsequent locally issued MPI read operations once the local 
call to MPI_FILE_SYNC completes locally.

There is currently no nonblocking or persistent expression of this MPI 
procedure - watch this space: this is on the to-do list for MPI-Next.

As Jim points out, the performance problem you note is most likely due to the 
implicit MPI_FILE_SYNC-like synchronisation done internally by MPI during the 
MPI_FILE_CLOSE procedure call. All enqueued file operations targeting the file 
will be flushed to the file during MPI_FILE_CLOSE. If file operations are not 
flushed to the file concurrently with the application stuff or the MPI 
communication operations, then they will still be enqueued when MPI_FILE_CLOSE 
is called.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.hol...@epcc.ed.ac.uk<mailto:d.hol...@epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
—

On 14 Aug 2020, at 17:32, Quincey Koziol via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>> wrote:

Hi Dan,
I believe that Pavan was referring to my conversation with him about 
MPI_Request_free.  Here’s my situation: I’d like to use MPI_Ibarrier as a form 
of “memory fence” between some of the metadata reads and writes in HDF5.   
Here’s some [very] simplified pseudocode for what I’d like to do:

===============================

<open HDF5 file>   // sets up a communicator for internal HDF5 communication 
about this file

do {
MPI_Ibarrier(<file’s communicator>, &request);

<application stuff>

// HDF5 operation:
if(<operation is read or write>) {
MPI_Wait(&request);
<perform read / write>
}
else {  // operation is a file close
MPI_Request_free(&request);
MPI_File_close(…);
MPI_Comm_free(<file’s communicator>);
}
} while (<file is open>);

===============================

What I am really trying to avoid is calling MPI_Wait at file close, since it is 
semantically unnecessary and only increases the latency from the application’s 
perspective.   If I can’t call MPI_Request_free on a nonblocking collective 
operation’s request (and it looks like I can’t, right now), I will have to put 
the request and file’s communicator into a “cleanup” list that is polled 
periodically [on each rank] with MPI_Test and disposed of when the nonblocking 
barrier completes locally.

So, I’d really like to be able to call MPI_Request_free on the nonblocking 
barrier’s request.

Thoughts?

Quincey


On Aug 13, 2020, at 9:07 AM, HOLMES Daniel via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>> wrote:

Hi Jim,

To be clear, I think that MPI_CANCEL is evil and should be removed from the MPI 
Standard entirely at the earliest convenience.

I am certainly not arguing that it be permitted for more MPI operations.

I thought the discussion was focused on MPI_REQUEST_FREE and whether or not it 
can/should be used on an active request.

If a particular MPI implementation does not keep a reference to the request 
between MPI_RPUT and MPI_REQUEST_FREE, but needs that reference to process the 
completion event, then that MPI implementation would be required to keep a 
reference to the request from MPI_REQUEST_FREE until that important task had 
been done, perhaps until the close epoch call. This requires no new memory 
because the user is giving up their reference to the request, so MPI can safely 
use the request it is passed in MPI_REQUEST_FREE without copying it. As you 
say, MPI takes over the responsibility for processing the completion event.

Your question about why the implementation should be required to take on this 
complexity is a good one. That, I guess, is why freeing any active request is a 
bad idea. MPI is required to differentiate completion of individual operations 
(so it can implement MPI_WAIT) but that means something must process completion 
at some point for each individual operation. In RMA, that responsibility can be 
discharged earlier than in other parts of the MPI interface, but the real 
question is “why should MPI offer to take on this responsibility in the first 
place?”

Thanks, that helps (me at least).

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.hol...@epcc.ed.ac.uk<mailto:d.hol...@epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
—

On 13 Aug 2020, at 14:43, Jim Dinan 
<james.di...@gmail.com<mailto:james.di...@gmail.com>> wrote:

The two cases you mentioned would have the same behavior at an application 
level. However, there may be important differences in the implementation of 
each operation. For example, an MPI_Put operation may be configured to not 
generate a completion event, whereas an MPI_Rput would. The library may be 
relying on the user to make a call on the request to process the event and 
clean up resources. The implementation can take over this responsibility if the 
user cancels the request, but why should we ask implementers to take on this 
complexity and overhead?

My $0.02 is that MPI_Cancel is subtle and complicated, and we should be very 
careful about where we allow it. I don't see the benefit to the programming 
model outweighing the complexity and overhead in the MPI runtime for the case 
of MPI_Rput. I also don't know that we were careful enough in specifying the 
RMA memory model that a canceled request-based RMA operation will still have 
well-defined behavior. My understanding is that MPI_Cancel is required 
primarily for canceling receive requests to meet MPI's quiescent shutdown 
requirement.

 ~Jim.

On Thu, Aug 13, 2020 at 8:11 AM HOLMES Daniel via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>> wrote:
Hi all,

To increase my own understanding of RMA, what is the difference (if any) 
between a request-based RMA operation where the request is freed without being 
completed and before the epoch is closed and a “normal” RMA operation?

MPI_LOCK() ! or any other "open epoch at origin" procedure call
doUserWorkBefore()
MPI_RPUT(&req)
MPI_REQUEST_FREE(&req)
doUserWorkAfter()
MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call

vs:

MPI_LOCK() ! or any other "open epoch at origin" procedure call
doUserWorkBefore()
MPI_PUT()
doUserWorkAfter()
MPI_UNLOCK() ! or the matching “close epoch at origin" procedure call

Is this a source-to-source translation that is always safe in either direction?

In RMA, in contrast to the rest of MPI, there are two opportunities for MPI to 
“block” and do non-local work to complete an RMA operation: 1) during MPI_WAIT 
for the request (if any - the user may not be given a request or the user may 
choose to free the request without calling MPI_WAIT or the user might call 
nonblocking MPI_TEST) and 2) during the close epoch procedure, which is always 
permitted to be sufficiently non-local to guarantee that the RMA operation is 
complete and its freeing stage has been done. It seems that a request-based RMA 
operation becomes identical to a “normal” RMA operation if the user calls 
MPI_REQUEST_FREE on the request. This is like “freeing" the request from a 
nonblocking point-to-point operation but without the guarantee of a later 
synchronisation procedure that can actually complete the operation and actually 
do the freeing stage of the operation.

In collectives, there is no “ensure all operations so far are now done” 
procedure call because there is no concept of epoch for collectives.
In point-to-point, there is no “ensure all operations so far are now done” 
procedure call because there is no concept of epoch for point-to-point.
In file operations, there is no “ensure all operations so far are now done” 
procedure call because there is no concept of epoch for file operations. (There 
is MPI_FILE_SYNC but it is optional so MPI cannot rely on it being called.)
In these cases, the only non-local procedure that is guaranteed to happen is 
MPI_FINALIZE, hence all outstanding non-local work needed by the “freed” 
operation might be delayed until that procedure is called.

The issue with copying parameters is also moot because all of them are 
passed-by-value (implicitly copied) or are data-buffers and covered by 
“conflicting accesses” RMA rules.

Thus, to me it seems to me that RMA is a very special case - it could support 
different semantics, but that does not provide a good basis for claiming that 
the rest of the MPI Standard can support those different semantics - unless we 
introduce an epoch concept into the rest of the MPI Standard. This is not 
unreasonable: the notifications in GASPI, for example, guarantee completion of 
not just the operation they are attached to but *all* operations issued in the 
“queue” they represent since the last notification. Their queue concept serves 
the purpose of an epoch. I’m sure there are other examples in other APIs. It 
seems to me likely that the proposal for MPI_PSYNC for partitioned 
communication operations is moving in the direction of an epoch, although 
limited to remote completion of all the partitions in a single operation, which 
accidentally guarantees that the operation can be freed locally using a local 
procedure.

Cheers,
Dan.
—
Dr Daniel Holmes PhD
Architect (HPC Research)
d.hol...@epcc.ed.ac.uk<mailto:d.hol...@epcc.ed.ac.uk>
Phone: +44 (0) 131 651 3465
Mobile: +44 (0) 7940 524 088
Address: Room 2.09, Bayes Centre, 47 Potterrow, Central Area, Edinburgh, EH8 9BT
—
The University of Edinburgh is a charitable body, registered in Scotland, with 
registration number SC005336.
—

On 13 Aug 2020, at 01:40, Skjellum, Anthony via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>> wrote:

FYI, one argument (also used to force us to add restrictions on MPI persistent 
collective initialization to be blocking)... The MPI_Request_free on an NBC 
poses a problem for the cases where there are array types
posed (e.g., Alltoallv/w)... It will not be knowable to the application if the 
vectors are in use by MPI still after
the  free on an active request.  We do *not* mandate that the MPI 
implementation copy such arrays currently, so they are effectively "held as 
unfreeable" by the MPI implementation till MPI_Finalize.  The user cannot 
deallocate them in a correct program till after MPI_Finalize.

Another effect for NBC of releasing an active request, IMHO,  is that you don't 
know when send buffers are free to be deallocated or receive buffers are free 
to be deallocated... since you don't know when the transfer is complete OR the 
buffers are no longer used by MPI (till after MPI_Finalize).

Tony




Anthony Skjellum, PhD
Professor of Computer Science and Chair of Excellence
Director, SimCenter
University of Tennessee at Chattanooga (UTC)
tony-skjel...@utc.edu<mailto:tony-skjel...@utc.edu>  [or 
skjel...@gmail.com<mailto:skjel...@gmail.com>]
cell: 205-807-4968

________________________________
From: mpi-forum 
<mpi-forum-boun...@lists.mpi-forum.org<mailto:mpi-forum-boun...@lists.mpi-forum.org>>
 on behalf of Jeff Hammond via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>>
Sent: Saturday, August 8, 2020 12:07 PM
To: Main MPI Forum mailing list 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>>
Cc: Jeff Hammond <jeff.scie...@gmail.com<mailto:jeff.scie...@gmail.com>>
Subject: Re: [Mpi-forum] MPI_Request_free restrictions

We should fix the RMA chapter with an erratum. I care less about NBC but share 
your ignorance of why it was done that way.

Sent from my iPhone

On Aug 8, 2020, at 6:51 AM, Balaji, Pavan via mpi-forum 
<mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>> wrote:

 Folks,

Does someone remember why we disallowed users from calling MPI_Request_free on 
nonblocking collective requests?  I remember the reasoning for not allowing 
cancel (i.e., the operation might have completed on some processes, but not 
all), but not for Request_free.  AFAICT, allowing the users to free the request 
doesn’t make any difference to the MPI library.  The MPI library would simply 
maintain its own refcount to the request and continue forward till the 
operation completes.  One of our users would like to free NBC requests so they 
don’t have to wait for the operation to complete in some situations.

Unfortunately, when I added the Rput/Rget operations in the RMA chapter, I 
copy-pasted that text into RMA as well without thinking too hard about it.  My 
bad!  Either the RMA committee missed it too, or they thought of a reason that 
I can’t think of now.

Can someone clarify or remind me what the reason was?

Regards,

  — Pavan

MPI-3.1 standard, page 197, lines 26-27:

“It is erroneous to call MPI_REQUEST_FREE or MPI_CANCEL for a request 
associated with a nonblocking collective operation.”

_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum
_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum


_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org<mailto:mpi-forum@lists.mpi-forum.org>
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

_______________________________________________
mpi-forum mailing list
mpi-forum@lists.mpi-forum.org
https://lists.mpi-forum.org/mailman/listinfo/mpi-forum

Re: [Mpi-forum] MPI_Request_free restrictions

Reply via email to