Brian,

My comment was mainly about the BTL code. MPI_Win_fence does not require
remote completion, the call only guarantees that all outbound operations
have been locally completed, and that all inbound operations from other
sources on the process are also complete. I agree with you on the Win_flush
implementation we have, it only guarantees the first part, and assumes the
barrier will drain the network of all pending messages.

You're right, the current implementation assumes that the MPI_Barrier
having a more synchronizing behavior and requiring more messages to be
exchanged between the participants, might increase the likelihood that even
with overtaking all pending messages have reached destination.

  George.


On Tue, Sep 28, 2021 at 10:36 PM Barrett, Brian <bbarr...@amazon.com> wrote:

> George –
>
>
>
> Is your comment about the code path referring to the BTL code or the OSC
> RDMA code?  The OSC code seems to expect remote completion, at least for
> the fence operation.  Fence is implemented as a btl flush followed by a
> window-wide barrier.  There’s no ordering specified between the RDMA
> operations completed by the flush and the send messages in the collective,
> so overtaking is possible.  Given that the BTL and the UCX PML (or OFI MTL
> or whatever) are likely using different QPs, ordering of the packets is
> doubtful.
>
>
>
> Like you, we saw that many BTLs appear to only guarantee local completion
> with flush().  So the question is which one is broken (and then we’ll have
> to figure out how to fix…).
>
>
>
> Brian
>
>
>
> On 9/28/21, 7:11 PM, "devel on behalf of George Bosilca via devel" <
> devel-boun...@lists.open-mpi.org on behalf of devel@lists.open-mpi.org>
> wrote:
>
>
>
>
>
> Based on my high-level understanding of the code path and according to the
> UCX implementation of the flush, the required level of completion is local.
>
>
>
>   George.
>
>
>
>
>
> On Tue, Sep 28, 2021 at 19:26 Zhang, Wei via devel <
> devel@lists.open-mpi.org> wrote:
>
> Dear All,
>
>
>
> I have a question regarding the completion semantics of btl_flush,
>
>
>
> In opal/mca/btl/btl.h,
>
>
>
>
> https://github.com/open-mpi/ompi/blob/4828663537e952e3d7cbf8fbf5359f16fdcaaade/opal/mca/btl/btl.h#L1146
>
>
>
> the comment about btl_flush says:
>
>
>
> * This function returns when all outstanding RDMA (put, get, atomic)
> operations
>
> * that were started prior to the flush call have completed.
>
>
>
> However, it is not clear to me what “complete” actually means? E.g. does
> it mean local completion (the action on RDMA initiator side has completed),
> or does it mean “remote completion”, (the action of RDMA remote side has
> completed). We are interested in this  because for many RDMA btls, “local
> completion” does not equal to “remote completion”.
>
>
>
> From the way btl_flush is used in osc/rdma’s fence operation (which is a
> call to flush followed by a MPI_Barrier), we think that btl_flush should
> mean remote completion, but want to get the clarification from the
> community.
>
>
>
> Sincerely,
>
>
>
> Wei Zhang
>
>
>
>

Reply via email to